I will double check the name.
If you did not configure with --disable-dlopen, then mpirun only links with 
opal and orte.
At run time, these libs will dlopen the plugins (from the openmpi sub 
directory, they are named mca_abc_xyz.so)
If you have support for tm, then one of the plugin will be linked with torque 
libs

Cheers,

Gilles

Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>Hi Gilles,
>
>I do not have this library. Maybe this helps already...
>
>libmca_common_sm.so  libmpi_mpifh.so  libmpi_usempif08.so          
>libompitrace.so  libopen-rte.so
>libmpi_cxx.so        libmpi.so        libmpi_usempi_ignore_tkr.so  
>libopen-pal.so   liboshmem.so
>
>and mpirun does only link to libopen-pal/libopen-rte (aside the standard 
>stuff)
>
>But still it is telling me that it has support for tm? libtorque is 
>there and the headers are also there and since i have enabled 
>tm...*sigh*
>
>Thanks again!
>
>Oswin
>
>On 2016-09-07 16:21, Gilles Gouaillardet wrote:
>> Note the torque library will only show up if you configure'd with
>> --disable-dlopen. Otherwise, you can ldd
>> /.../lib/openmpi/mca_plm_tm.so
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Bennet Fauber <ben...@umich.edu> wrote:
>>> Oswin,
>>> 
>>> Does the torque library show up if you run
>>> 
>>> $ ldd mpirun
>>> 
>>> That would indicate that Torque support is compiled in.
>>> 
>>> Also, what happens if you use the same hostfile, or some hostfile as
>>> an explicit argument when you run mpirun from within the torque job?
>>> 
>>> -- bennet
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause
>>> <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>> Hi Gilles,
>>>> 
>>>> Thanks for the hint with the machinefile. I know it is not equivalent 
>>>> and i
>>>> do not intend to use that approach. I just wanted to know whether I 
>>>> could
>>>> start the program successfully at all.
>>>> 
>>>> Outside torque(4.2), rsh seems to be used which works fine, querying 
>>>> a
>>>> password if no kerberos ticket is there
>>>> 
>>>> Here is the output:
>>>> [zbh251@a00551 ~]$ mpirun -V
>>>> mpirun (Open MPI) 2.0.1
>>>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>>>                  MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, 
>>>> Component
>>>> v2.0.1)
>>>>                  MCA ras: simulator (MCA v2.1.0, API v2.0.0, 
>>>> Component
>>>> v2.0.1)
>>>>                  MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component 
>>>> v2.0.1)
>>>>                  MCA ras: tm (MCA v2.1.0, API v2.0.0, Component 
>>>> v2.0.1)
>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
>>>> -display-map hostname
>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>> registering
>>>> framework plm components
>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>> loaded
>>>> component isolated
>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>> component
>>>> isolated has no register or open function
>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>> loaded
>>>> component rsh
>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>> component rsh
>>>> register function successful
>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>> loaded
>>>> component slurm
>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>> component
>>>> slurm register function successful
>>>> [a00551.science.domain:04104] mca: base: components_register: found 
>>>> loaded
>>>> component tm
>>>> [a00551.science.domain:04104] mca: base: components_register: 
>>>> component tm
>>>> register function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: opening plm
>>>> components
>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>> loaded
>>>> component isolated
>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>> isolated
>>>> open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>> loaded
>>>> component rsh
>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>> rsh open
>>>> function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>> loaded
>>>> component slurm
>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>> slurm
>>>> open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found 
>>>> loaded
>>>> component tm
>>>> [a00551.science.domain:04104] mca: base: components_open: component 
>>>> tm open
>>>> function successful
>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm 
>>>> components
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>> component
>>>> [isolated]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of 
>>>> component
>>>> [isolated] set priority to 0
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>> component
>>>> [rsh]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of 
>>>> component
>>>> [rsh] set priority to 10
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>> component
>>>> [slurm]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying 
>>>> component
>>>> [tm]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of 
>>>> component
>>>> [tm] set priority to 75
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected 
>>>> component
>>>> [tm]
>>>> [a00551.science.domain:04104] mca: base: close: component isolated 
>>>> closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>> isolated
>>>> [a00551.science.domain:04104] mca: base: close: component rsh closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>> rsh
>>>> [a00551.science.domain:04104] mca: base: close: component slurm 
>>>> closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>> slurm
>>>> [a00551.science.domain:04109] mca: base: components_register: 
>>>> registering
>>>> framework plm components
>>>> [a00551.science.domain:04109] mca: base: components_register: found 
>>>> loaded
>>>> component rsh
>>>> [a00551.science.domain:04109] mca: base: components_register: 
>>>> component rsh
>>>> register function successful
>>>> [a00551.science.domain:04109] mca: base: components_open: opening plm
>>>> components
>>>> [a00551.science.domain:04109] mca: base: components_open: found 
>>>> loaded
>>>> component rsh
>>>> [a00551.science.domain:04109] mca: base: components_open: component 
>>>> rsh open
>>>> function successful
>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm 
>>>> components
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Querying 
>>>> component
>>>> [rsh]
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Query of 
>>>> component
>>>> [rsh] set priority to 10
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Selected 
>>>> component
>>>> [rsh]
>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error 
>>>> Address
>>>> already in use (98)
>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in 
>>>> file
>>>> oob_usock_component.c at line 228
>>>>  Data for JOB [53688,1] offset 0
>>>> 
>>>>  ========================   JOB MAP   ========================
>>>> 
>>>>  Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 2
>>>>         Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: 
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>>>> 0-1]],
>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 
>>>> 5[hwt
>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 
>>>> 0[core
>>>> 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>         Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: 
>>>> socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 
>>>> 0-1]],
>>>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 
>>>> 15[hwt
>>>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 
>>>> 1[core
>>>> 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>> 
>>>>  Data for node: a00553.science.domain   Num slots: 1    Max slots: 0  
>>>>   Num
>>>> procs: 1
>>>>         Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: 
>>>> socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 
>>>> 0-1]],
>>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 
>>>> 5[hwt
>>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 
>>>> 0[core
>>>> 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>> 
>>>>  =============================================================
>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job 
>>>> [53688,1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update 
>>>> proc
>>>> state command from [[53688,0],1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>> update_proc_state for job [53688,1]
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update 
>>>> proc
>>>> state command from [[53688,0],1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>> update_proc_state for job [53688,1]
>>>> [1,1]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:04109] mca: base: close: component rsh closed
>>>> [a00551.science.domain:04109] mca: base: close: unloading component 
>>>> rsh
>>>> [a00551.science.domain:04104] mca: base: close: component tm closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component 
>>>> tm
>>>> 
>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Which version of Open MPI are you running ?
>>>>> 
>>>>> I noted that though you are asking three nodes and one task per 
>>>>> node,
>>>>> you have been allocated 2 nodes only.
>>>>> I do not know if this is related to this issue.
>>>>> 
>>>>> Note if you use the machinefile, a00551 has two slots (since it
>>>>> appears twice in the machinefile) but a00553 has 20 slots (since it
>>>>> appears once in the machinefile, the number of slots is 
>>>>> automatically
>>>>> detected)
>>>>> 
>>>>> Can you run
>>>>> mpirun --mca plm_base_verbose 10 ...
>>>>> So we can confirm tm is used.
>>>>> 
>>>>> Before invoking mpirun, you might want to cleanup the ompi directory 
>>>>> in
>>>>> /tmp
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build 
>>>>>> with
>>>>>> tm support. Torque is correctly assigning nodes and I can run
>>>>>> mpi-programs on single nodes just fine. the problem starts when
>>>>>> processes are split between nodes.
>>>>>> 
>>>>>> For example, I create an interactive session with torque and start 
>>>>>> a
>>>>>> program by
>>>>>> 
>>>>>> qsub -I -n -l nodes=3:ppn=1
>>>>>> mpirun --tag-output -display-map hostname
>>>>>> 
>>>>>> which leads to
>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>>>>> Address already in use (98)
>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error 
>>>>>> in
>>>>>> file oob_usock_component.c at line 228
>>>>>>  Data for JOB [65415,1] offset 0
>>>>>> 
>>>>>>  ========================   JOB MAP   ========================
>>>>>> 
>>>>>>  Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 
>>>>>> 2
>>>>>>         Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>         Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound:
>>>>>> socket
>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 
>>>>>> 12[hwt
>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
>>>>>> socket
>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 
>>>>>> 17[hwt
>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>> 
>>>>>>  Data for node: a00553.science.domain   Num slots: 1    Max slots: 
>>>>>> 0
>>>>>> Num
>>>>>> procs: 1
>>>>>>         Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>> 
>>>>>>  =============================================================
>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>> [1,2]<stdout>:a00551.science.domain
>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>> 
>>>>>> 
>>>>>> if I login on a00551 and start using the hostfile generated by the
>>>>>> PBS_NODEFILE, everything works:
>>>>>> 
>>>>>> (from within the interactive session)
>>>>>> echo $PBS_NODEFILE
>>>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>>>> cat $PBS_NODEFILE
>>>>>> a00551.science.domain
>>>>>> a00553.science.domain
>>>>>> a00551.science.domain
>>>>>> 
>>>>>> (from within the separate login)
>>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain 
>>>>>> -np 3
>>>>>> --tag-output -display-map hostname
>>>>>> 
>>>>>>  Data for JOB [65445,1] offset 0
>>>>>> 
>>>>>>  ========================   JOB MAP   ========================
>>>>>> 
>>>>>>  Data for node: a00551  Num slots: 2    Max slots: 0    Num procs: 
>>>>>> 2
>>>>>>         Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>         Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound:
>>>>>> socket
>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 
>>>>>> 12[hwt
>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
>>>>>> socket
>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 
>>>>>> 17[hwt
>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>> 
>>>>>>  Data for node: a00553.science.domain   Num slots: 20   Max slots: 
>>>>>> 0
>>>>>> Num
>>>>>> procs: 1
>>>>>>         Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound:
>>>>>> socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>> 
>>>>>>  =============================================================
>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>> [1,2]<stdout>:a00553.science.domain
>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>> 
>>>>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>>>>> seriously considering this to be the problem of kerberos
>>>>>> authentification that we have to work with, but I fail to see how 
>>>>>> this
>>>>>> should affect the sockets.
>>>>>> 
>>>>>> Best,
>>>>>> Oswin
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org
>https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to