You aren’t looking in the right place - there is an “openmpi” directory 
underneath that one, and the mca_xxx libraries are down there

> On Sep 7, 2016, at 7:43 AM, Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> 
> wrote:
> 
> Hi Gilles,
> 
> I do not have this library. Maybe this helps already...
> 
> libmca_common_sm.so  libmpi_mpifh.so  libmpi_usempif08.so          
> libompitrace.so  libopen-rte.so
> libmpi_cxx.so        libmpi.so        libmpi_usempi_ignore_tkr.so  
> libopen-pal.so   liboshmem.so
> 
> and mpirun does only link to libopen-pal/libopen-rte (aside the standard 
> stuff)
> 
> But still it is telling me that it has support for tm? libtorque is there and 
> the headers are also there and since i have enabled tm...*sigh*
> 
> Thanks again!
> 
> Oswin
> 
> On 2016-09-07 16:21, Gilles Gouaillardet wrote:
>> Note the torque library will only show up if you configure'd with
>> --disable-dlopen. Otherwise, you can ldd
>> /.../lib/openmpi/mca_plm_tm.so
>> Cheers,
>> Gilles
>> Bennet Fauber <ben...@umich.edu> wrote:
>>> Oswin,
>>> Does the torque library show up if you run
>>> $ ldd mpirun
>>> That would indicate that Torque support is compiled in.
>>> Also, what happens if you use the same hostfile, or some hostfile as
>>> an explicit argument when you run mpirun from within the torque job?
>>> -- bennet
>>> On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause
>>> <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>> Hi Gilles,
>>>> Thanks for the hint with the machinefile. I know it is not equivalent and i
>>>> do not intend to use that approach. I just wanted to know whether I could
>>>> start the program successfully at all.
>>>> Outside torque(4.2), rsh seems to be used which works fine, querying a
>>>> password if no kerberos ticket is there
>>>> Here is the output:
>>>> [zbh251@a00551 ~]$ mpirun -V
>>>> mpirun (Open MPI) 2.0.1
>>>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>>>                 MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component
>>>> v2.0.1)
>>>>                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
>>>> v2.0.1)
>>>>                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>>>>                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
>>>> -display-map hostname
>>>> [a00551.science.domain:04104] mca: base: components_register: registering
>>>> framework plm components
>>>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>>>> component isolated
>>>> [a00551.science.domain:04104] mca: base: components_register: component
>>>> isolated has no register or open function
>>>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>>>> component rsh
>>>> [a00551.science.domain:04104] mca: base: components_register: component rsh
>>>> register function successful
>>>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>>>> component slurm
>>>> [a00551.science.domain:04104] mca: base: components_register: component
>>>> slurm register function successful
>>>> [a00551.science.domain:04104] mca: base: components_register: found loaded
>>>> component tm
>>>> [a00551.science.domain:04104] mca: base: components_register: component tm
>>>> register function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: opening plm
>>>> components
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component isolated
>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>> isolated
>>>> open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component rsh
>>>> [a00551.science.domain:04104] mca: base: components_open: component rsh 
>>>> open
>>>> function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component slurm
>>>> [a00551.science.domain:04104] mca: base: components_open: component slurm
>>>> open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component tm
>>>> [a00551.science.domain:04104] mca: base: components_open: component tm open
>>>> function successful
>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm 
>>>> components
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [isolated]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>>>> [isolated] set priority to 0
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [rsh]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>>>> [rsh] set priority to 10
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [slurm]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [tm]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>>>> [tm] set priority to 75
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected component
>>>> [tm]
>>>> [a00551.science.domain:04104] mca: base: close: component isolated closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>> isolated
>>>> [a00551.science.domain:04104] mca: base: close: component rsh closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component rsh
>>>> [a00551.science.domain:04104] mca: base: close: component slurm closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component slurm
>>>> [a00551.science.domain:04109] mca: base: components_register: registering
>>>> framework plm components
>>>> [a00551.science.domain:04109] mca: base: components_register: found loaded
>>>> component rsh
>>>> [a00551.science.domain:04109] mca: base: components_register: component rsh
>>>> register function successful
>>>> [a00551.science.domain:04109] mca: base: components_open: opening plm
>>>> components
>>>> [a00551.science.domain:04109] mca: base: components_open: found loaded
>>>> component rsh
>>>> [a00551.science.domain:04109] mca: base: components_open: component rsh 
>>>> open
>>>> function successful
>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm 
>>>> components
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Querying component
>>>> [rsh]
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Query of component
>>>> [rsh] set priority to 10
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Selected component
>>>> [rsh]
>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error Address
>>>> already in use (98)
>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in file
>>>> oob_usock_component.c at line 228
>>>> Data for JOB [53688,1] offset 0
>>>> ========================   JOB MAP   ========================
>>>> Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>        Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]],
>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt
>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core
>>>> 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>        Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 
>>>> 0-1]],
>>>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 
>>>> 15[hwt
>>>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 
>>>> 1[core
>>>> 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>> Data for node: a00553.science.domain   Num slots: 1    Max slots: 0    Num
>>>> procs: 1
>>>>        Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]],
>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt
>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core
>>>> 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>> =============================================================
>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
>>>> state command from [[53688,0],1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>> update_proc_state for job [53688,1]
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
>>>> state command from [[53688,0],1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>> update_proc_state for job [53688,1]
>>>> [1,1]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:04109] mca: base: close: component rsh closed
>>>> [a00551.science.domain:04109] mca: base: close: unloading component rsh
>>>> [a00551.science.domain:04104] mca: base: close: component tm closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component tm
>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>>>> Hi,
>>>>> Which version of Open MPI are you running ?
>>>>> I noted that though you are asking three nodes and one task per node,
>>>>> you have been allocated 2 nodes only.
>>>>> I do not know if this is related to this issue.
>>>>> Note if you use the machinefile, a00551 has two slots (since it
>>>>> appears twice in the machinefile) but a00553 has 20 slots (since it
>>>>> appears once in the machinefile, the number of slots is automatically
>>>>> detected)
>>>>> Can you run
>>>>> mpirun --mca plm_base_verbose 10 ...
>>>>> So we can confirm tm is used.
>>>>> Before invoking mpirun, you might want to cleanup the ompi directory in
>>>>> /tmp
>>>>> Cheers,
>>>>> Gilles
>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>>> Hi,
>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build with
>>>>>> tm support. Torque is correctly assigning nodes and I can run
>>>>>> mpi-programs on single nodes just fine. the problem starts when
>>>>>> processes are split between nodes.
>>>>>> For example, I create an interactive session with torque and start a
>>>>>> program by
>>>>>> qsub -I -n -l nodes=3:ppn=1
>>>>>> mpirun --tag-output -display-map hostname
>>>>>> which leads to
>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>>>>> Address already in use (98)
>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
>>>>>> file oob_usock_component.c at line 228
>>>>>> Data for JOB [65415,1] offset 0
>>>>>> ========================   JOB MAP   ========================
>>>>>> Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>>>        Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>        Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound:
>>>>>> socket
>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>> Data for node: a00553.science.domain   Num slots: 1    Max slots: 0
>>>>>> Num
>>>>>> procs: 1
>>>>>>        Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>> =============================================================
>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>> [1,2]<stdout>:a00551.science.domain
>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>> if I login on a00551 and start using the hostfile generated by the
>>>>>> PBS_NODEFILE, everything works:
>>>>>> (from within the interactive session)
>>>>>> echo $PBS_NODEFILE
>>>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>>>> cat $PBS_NODEFILE
>>>>>> a00551.science.domain
>>>>>> a00553.science.domain
>>>>>> a00551.science.domain
>>>>>> (from within the separate login)
>>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
>>>>>> --tag-output -display-map hostname
>>>>>> Data for JOB [65445,1] offset 0
>>>>>> ========================   JOB MAP   ========================
>>>>>> Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>>>        Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>        Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound:
>>>>>> socket
>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>> Data for node: a00553.science.domain   Num slots: 20   Max slots: 0
>>>>>> Num
>>>>>> procs: 1
>>>>>>        Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>> =============================================================
>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>> [1,2]<stdout>:a00553.science.domain
>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>>>>> seriously considering this to be the problem of kerberos
>>>>>> authentification that we have to work with, but I fail to see how this
>>>>>> should affect the sockets.
>>>>>> Best,
>>>>>> Oswin
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to