You aren’t looking in the right place - there is an “openmpi” directory underneath that one, and the mca_xxx libraries are down there
> On Sep 7, 2016, at 7:43 AM, Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> > wrote: > > Hi Gilles, > > I do not have this library. Maybe this helps already... > > libmca_common_sm.so libmpi_mpifh.so libmpi_usempif08.so > libompitrace.so libopen-rte.so > libmpi_cxx.so libmpi.so libmpi_usempi_ignore_tkr.so > libopen-pal.so liboshmem.so > > and mpirun does only link to libopen-pal/libopen-rte (aside the standard > stuff) > > But still it is telling me that it has support for tm? libtorque is there and > the headers are also there and since i have enabled tm...*sigh* > > Thanks again! > > Oswin > > On 2016-09-07 16:21, Gilles Gouaillardet wrote: >> Note the torque library will only show up if you configure'd with >> --disable-dlopen. Otherwise, you can ldd >> /.../lib/openmpi/mca_plm_tm.so >> Cheers, >> Gilles >> Bennet Fauber <ben...@umich.edu> wrote: >>> Oswin, >>> Does the torque library show up if you run >>> $ ldd mpirun >>> That would indicate that Torque support is compiled in. >>> Also, what happens if you use the same hostfile, or some hostfile as >>> an explicit argument when you run mpirun from within the torque job? >>> -- bennet >>> On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause >>> <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>> Hi Gilles, >>>> Thanks for the hint with the machinefile. I know it is not equivalent and i >>>> do not intend to use that approach. I just wanted to know whether I could >>>> start the program successfully at all. >>>> Outside torque(4.2), rsh seems to be used which works fine, querying a >>>> password if no kerberos ticket is there >>>> Here is the output: >>>> [zbh251@a00551 ~]$ mpirun -V >>>> mpirun (Open MPI) 2.0.1 >>>> [zbh251@a00551 ~]$ ompi_info | grep ras >>>> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component >>>> v2.0.1) >>>> MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component >>>> v2.0.1) >>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >>>> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output >>>> -display-map hostname >>>> [a00551.science.domain:04104] mca: base: components_register: registering >>>> framework plm components >>>> [a00551.science.domain:04104] mca: base: components_register: found loaded >>>> component isolated >>>> [a00551.science.domain:04104] mca: base: components_register: component >>>> isolated has no register or open function >>>> [a00551.science.domain:04104] mca: base: components_register: found loaded >>>> component rsh >>>> [a00551.science.domain:04104] mca: base: components_register: component rsh >>>> register function successful >>>> [a00551.science.domain:04104] mca: base: components_register: found loaded >>>> component slurm >>>> [a00551.science.domain:04104] mca: base: components_register: component >>>> slurm register function successful >>>> [a00551.science.domain:04104] mca: base: components_register: found loaded >>>> component tm >>>> [a00551.science.domain:04104] mca: base: components_register: component tm >>>> register function successful >>>> [a00551.science.domain:04104] mca: base: components_open: opening plm >>>> components >>>> [a00551.science.domain:04104] mca: base: components_open: found loaded >>>> component isolated >>>> [a00551.science.domain:04104] mca: base: components_open: component >>>> isolated >>>> open function successful >>>> [a00551.science.domain:04104] mca: base: components_open: found loaded >>>> component rsh >>>> [a00551.science.domain:04104] mca: base: components_open: component rsh >>>> open >>>> function successful >>>> [a00551.science.domain:04104] mca: base: components_open: found loaded >>>> component slurm >>>> [a00551.science.domain:04104] mca: base: components_open: component slurm >>>> open function successful >>>> [a00551.science.domain:04104] mca: base: components_open: found loaded >>>> component tm >>>> [a00551.science.domain:04104] mca: base: components_open: component tm open >>>> function successful >>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm >>>> components >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >>>> [isolated] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >>>> [isolated] set priority to 0 >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >>>> [rsh] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >>>> [rsh] set priority to 10 >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >>>> [slurm] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >>>> [tm] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >>>> [tm] set priority to 75 >>>> [a00551.science.domain:04104] mca:base:select:( plm) Selected component >>>> [tm] >>>> [a00551.science.domain:04104] mca: base: close: component isolated closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>> isolated >>>> [a00551.science.domain:04104] mca: base: close: component rsh closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component rsh >>>> [a00551.science.domain:04104] mca: base: close: component slurm closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component slurm >>>> [a00551.science.domain:04109] mca: base: components_register: registering >>>> framework plm components >>>> [a00551.science.domain:04109] mca: base: components_register: found loaded >>>> component rsh >>>> [a00551.science.domain:04109] mca: base: components_register: component rsh >>>> register function successful >>>> [a00551.science.domain:04109] mca: base: components_open: opening plm >>>> components >>>> [a00551.science.domain:04109] mca: base: components_open: found loaded >>>> component rsh >>>> [a00551.science.domain:04109] mca: base: components_open: component rsh >>>> open >>>> function successful >>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm >>>> components >>>> [a00551.science.domain:04109] mca:base:select:( plm) Querying component >>>> [rsh] >>>> [a00551.science.domain:04109] mca:base:select:( plm) Query of component >>>> [rsh] set priority to 10 >>>> [a00551.science.domain:04109] mca:base:select:( plm) Selected component >>>> [rsh] >>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error Address >>>> already in use (98) >>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in file >>>> oob_usock_component.c at line 228 >>>> Data for JOB [53688,1] offset 0 >>>> ======================== JOB MAP ======================== >>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], >>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt >>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core >>>> 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket >>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>> 0-1]], >>>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core >>>> 15[hwt >>>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket >>>> 1[core >>>> 18[hwt 0-1]], socket 1[core 19[hwt >>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num >>>> procs: 1 >>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], >>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt >>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core >>>> 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> ============================================================= >>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1] >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc >>>> state command from [[53688,0],1] >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >>>> update_proc_state for job [53688,1] >>>> [1,0]<stdout>:a00551.science.domain >>>> [1,2]<stdout>:a00551.science.domain >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc >>>> state command from [[53688,0],1] >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >>>> update_proc_state for job [53688,1] >>>> [1,1]<stdout>:a00551.science.domain >>>> [a00551.science.domain:04109] mca: base: close: component rsh closed >>>> [a00551.science.domain:04109] mca: base: close: unloading component rsh >>>> [a00551.science.domain:04104] mca: base: close: component tm closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component tm >>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote: >>>>> Hi, >>>>> Which version of Open MPI are you running ? >>>>> I noted that though you are asking three nodes and one task per node, >>>>> you have been allocated 2 nodes only. >>>>> I do not know if this is related to this issue. >>>>> Note if you use the machinefile, a00551 has two slots (since it >>>>> appears twice in the machinefile) but a00553 has 20 slots (since it >>>>> appears once in the machinefile, the number of slots is automatically >>>>> detected) >>>>> Can you run >>>>> mpirun --mca plm_base_verbose 10 ... >>>>> So we can confirm tm is used. >>>>> Before invoking mpirun, you might want to cleanup the ompi directory in >>>>> /tmp >>>>> Cheers, >>>>> Gilles >>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>>>> Hi, >>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build with >>>>>> tm support. Torque is correctly assigning nodes and I can run >>>>>> mpi-programs on single nodes just fine. the problem starts when >>>>>> processes are split between nodes. >>>>>> For example, I create an interactive session with torque and start a >>>>>> program by >>>>>> qsub -I -n -l nodes=3:ppn=1 >>>>>> mpirun --tag-output -display-map hostname >>>>>> which leads to >>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error >>>>>> Address already in use (98) >>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in >>>>>> file oob_usock_component.c at line 228 >>>>>> Data for JOB [65415,1] offset 0 >>>>>> ======================== JOB MAP ======================== >>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: >>>>>> socket >>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>>>>> Num >>>>>> procs: 1 >>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> ============================================================= >>>>>> [1,0]<stdout>:a00551.science.domain >>>>>> [1,2]<stdout>:a00551.science.domain >>>>>> [1,1]<stdout>:a00551.science.domain >>>>>> if I login on a00551 and start using the hostfile generated by the >>>>>> PBS_NODEFILE, everything works: >>>>>> (from within the interactive session) >>>>>> echo $PBS_NODEFILE >>>>>> /var/lib/torque/aux//278.a00552.science.domain >>>>>> cat $PBS_NODEFILE >>>>>> a00551.science.domain >>>>>> a00553.science.domain >>>>>> a00551.science.domain >>>>>> (from within the separate login) >>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3 >>>>>> --tag-output -display-map hostname >>>>>> Data for JOB [65445,1] offset 0 >>>>>> ======================== JOB MAP ======================== >>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: >>>>>> socket >>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket >>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt >>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>> Data for node: a00553.science.domain Num slots: 20 Max slots: 0 >>>>>> Num >>>>>> procs: 1 >>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> ============================================================= >>>>>> [1,0]<stdout>:a00551.science.domain >>>>>> [1,2]<stdout>:a00553.science.domain >>>>>> [1,1]<stdout>:a00551.science.domain >>>>>> I am kind of lost whats going on here. Anyone having an idea? I am >>>>>> seriously considering this to be the problem of kerberos >>>>>> authentification that we have to work with, but I fail to see how this >>>>>> should affect the sockets. >>>>>> Best, >>>>>> Oswin >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users