I will double check the name. If you did not configure with --disable-dlopen, then mpirun only links with opal and orte. At run time, these libs will dlopen the plugins (from the openmpi sub directory, they are named mca_abc_xyz.so) If you have support for tm, then one of the plugin will be linked with torque libs
Cheers, Gilles Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >Hi Gilles, > >I do not have this library. Maybe this helps already... > >libmca_common_sm.so libmpi_mpifh.so libmpi_usempif08.so >libompitrace.so libopen-rte.so >libmpi_cxx.so libmpi.so libmpi_usempi_ignore_tkr.so >libopen-pal.so liboshmem.so > >and mpirun does only link to libopen-pal/libopen-rte (aside the standard >stuff) > >But still it is telling me that it has support for tm? libtorque is >there and the headers are also there and since i have enabled >tm...*sigh* > >Thanks again! > >Oswin > >On 2016-09-07 16:21, Gilles Gouaillardet wrote: >> Note the torque library will only show up if you configure'd with >> --disable-dlopen. Otherwise, you can ldd >> /.../lib/openmpi/mca_plm_tm.so >> >> Cheers, >> >> Gilles >> >> Bennet Fauber <ben...@umich.edu> wrote: >>> Oswin, >>> >>> Does the torque library show up if you run >>> >>> $ ldd mpirun >>> >>> That would indicate that Torque support is compiled in. >>> >>> Also, what happens if you use the same hostfile, or some hostfile as >>> an explicit argument when you run mpirun from within the torque job? >>> >>> -- bennet >>> >>> >>> >>> >>> On Wed, Sep 7, 2016 at 9:25 AM, Oswin Krause >>> <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>> Hi Gilles, >>>> >>>> Thanks for the hint with the machinefile. I know it is not equivalent >>>> and i >>>> do not intend to use that approach. I just wanted to know whether I >>>> could >>>> start the program successfully at all. >>>> >>>> Outside torque(4.2), rsh seems to be used which works fine, querying >>>> a >>>> password if no kerberos ticket is there >>>> >>>> Here is the output: >>>> [zbh251@a00551 ~]$ mpirun -V >>>> mpirun (Open MPI) 2.0.1 >>>> [zbh251@a00551 ~]$ ompi_info | grep ras >>>> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, >>>> Component >>>> v2.0.1) >>>> MCA ras: simulator (MCA v2.1.0, API v2.0.0, >>>> Component >>>> v2.0.1) >>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component >>>> v2.0.1) >>>> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component >>>> v2.0.1) >>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output >>>> -display-map hostname >>>> [a00551.science.domain:04104] mca: base: components_register: >>>> registering >>>> framework plm components >>>> [a00551.science.domain:04104] mca: base: components_register: found >>>> loaded >>>> component isolated >>>> [a00551.science.domain:04104] mca: base: components_register: >>>> component >>>> isolated has no register or open function >>>> [a00551.science.domain:04104] mca: base: components_register: found >>>> loaded >>>> component rsh >>>> [a00551.science.domain:04104] mca: base: components_register: >>>> component rsh >>>> register function successful >>>> [a00551.science.domain:04104] mca: base: components_register: found >>>> loaded >>>> component slurm >>>> [a00551.science.domain:04104] mca: base: components_register: >>>> component >>>> slurm register function successful >>>> [a00551.science.domain:04104] mca: base: components_register: found >>>> loaded >>>> component tm >>>> [a00551.science.domain:04104] mca: base: components_register: >>>> component tm >>>> register function successful >>>> [a00551.science.domain:04104] mca: base: components_open: opening plm >>>> components >>>> [a00551.science.domain:04104] mca: base: components_open: found >>>> loaded >>>> component isolated >>>> [a00551.science.domain:04104] mca: base: components_open: component >>>> isolated >>>> open function successful >>>> [a00551.science.domain:04104] mca: base: components_open: found >>>> loaded >>>> component rsh >>>> [a00551.science.domain:04104] mca: base: components_open: component >>>> rsh open >>>> function successful >>>> [a00551.science.domain:04104] mca: base: components_open: found >>>> loaded >>>> component slurm >>>> [a00551.science.domain:04104] mca: base: components_open: component >>>> slurm >>>> open function successful >>>> [a00551.science.domain:04104] mca: base: components_open: found >>>> loaded >>>> component tm >>>> [a00551.science.domain:04104] mca: base: components_open: component >>>> tm open >>>> function successful >>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm >>>> components >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>> component >>>> [isolated] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>> component >>>> [isolated] set priority to 0 >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>> component >>>> [rsh] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>> component >>>> [rsh] set priority to 10 >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>> component >>>> [slurm] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Querying >>>> component >>>> [tm] >>>> [a00551.science.domain:04104] mca:base:select:( plm) Query of >>>> component >>>> [tm] set priority to 75 >>>> [a00551.science.domain:04104] mca:base:select:( plm) Selected >>>> component >>>> [tm] >>>> [a00551.science.domain:04104] mca: base: close: component isolated >>>> closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>> isolated >>>> [a00551.science.domain:04104] mca: base: close: component rsh closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>> rsh >>>> [a00551.science.domain:04104] mca: base: close: component slurm >>>> closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>> slurm >>>> [a00551.science.domain:04109] mca: base: components_register: >>>> registering >>>> framework plm components >>>> [a00551.science.domain:04109] mca: base: components_register: found >>>> loaded >>>> component rsh >>>> [a00551.science.domain:04109] mca: base: components_register: >>>> component rsh >>>> register function successful >>>> [a00551.science.domain:04109] mca: base: components_open: opening plm >>>> components >>>> [a00551.science.domain:04109] mca: base: components_open: found >>>> loaded >>>> component rsh >>>> [a00551.science.domain:04109] mca: base: components_open: component >>>> rsh open >>>> function successful >>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm >>>> components >>>> [a00551.science.domain:04109] mca:base:select:( plm) Querying >>>> component >>>> [rsh] >>>> [a00551.science.domain:04109] mca:base:select:( plm) Query of >>>> component >>>> [rsh] set priority to 10 >>>> [a00551.science.domain:04109] mca:base:select:( plm) Selected >>>> component >>>> [rsh] >>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error >>>> Address >>>> already in use (98) >>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in >>>> file >>>> oob_usock_component.c at line 228 >>>> Data for JOB [53688,1] offset 0 >>>> >>>> ======================== JOB MAP ======================== >>>> >>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2 >>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], >>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core >>>> 5[hwt >>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket >>>> 0[core >>>> 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: >>>> socket >>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt >>>> 0-1]], >>>> socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core >>>> 15[hwt >>>> 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket >>>> 1[core >>>> 18[hwt 0-1]], socket 1[core 19[hwt >>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>> >>>> Data for node: a00553.science.domain Num slots: 1 Max slots: 0 >>>> Num >>>> procs: 1 >>>> Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: >>>> socket >>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>> 0-1]], >>>> socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core >>>> 5[hwt >>>> 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket >>>> 0[core >>>> 8[hwt 0-1]], socket 0[core 9[hwt >>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>> >>>> ============================================================= >>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job >>>> [53688,1] >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update >>>> proc >>>> state command from [[53688,0],1] >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >>>> update_proc_state for job [53688,1] >>>> [1,0]<stdout>:a00551.science.domain >>>> [1,2]<stdout>:a00551.science.domain >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update >>>> proc >>>> state command from [[53688,0],1] >>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got >>>> update_proc_state for job [53688,1] >>>> [1,1]<stdout>:a00551.science.domain >>>> [a00551.science.domain:04109] mca: base: close: component rsh closed >>>> [a00551.science.domain:04109] mca: base: close: unloading component >>>> rsh >>>> [a00551.science.domain:04104] mca: base: close: component tm closed >>>> [a00551.science.domain:04104] mca: base: close: unloading component >>>> tm >>>> >>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote: >>>>> >>>>> Hi, >>>>> >>>>> Which version of Open MPI are you running ? >>>>> >>>>> I noted that though you are asking three nodes and one task per >>>>> node, >>>>> you have been allocated 2 nodes only. >>>>> I do not know if this is related to this issue. >>>>> >>>>> Note if you use the machinefile, a00551 has two slots (since it >>>>> appears twice in the machinefile) but a00553 has 20 slots (since it >>>>> appears once in the machinefile, the number of slots is >>>>> automatically >>>>> detected) >>>>> >>>>> Can you run >>>>> mpirun --mca plm_base_verbose 10 ... >>>>> So we can confirm tm is used. >>>>> >>>>> Before invoking mpirun, you might want to cleanup the ompi directory >>>>> in >>>>> /tmp >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build >>>>>> with >>>>>> tm support. Torque is correctly assigning nodes and I can run >>>>>> mpi-programs on single nodes just fine. the problem starts when >>>>>> processes are split between nodes. >>>>>> >>>>>> For example, I create an interactive session with torque and start >>>>>> a >>>>>> program by >>>>>> >>>>>> qsub -I -n -l nodes=3:ppn=1 >>>>>> mpirun --tag-output -display-map hostname >>>>>> >>>>>> which leads to >>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error >>>>>> Address already in use (98) >>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error >>>>>> in >>>>>> file oob_usock_component.c at line 228 >>>>>> Data for JOB [65415,1] offset 0 >>>>>> >>>>>> ======================== JOB MAP ======================== >>>>>> >>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: >>>>>> 2 >>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: >>>>>> socket >>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core >>>>>> 12[hwt >>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], >>>>>> socket >>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core >>>>>> 17[hwt >>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>> >>>>>> Data for node: a00553.science.domain Num slots: 1 Max slots: >>>>>> 0 >>>>>> Num >>>>>> procs: 1 >>>>>> Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> >>>>>> ============================================================= >>>>>> [1,0]<stdout>:a00551.science.domain >>>>>> [1,2]<stdout>:a00551.science.domain >>>>>> [1,1]<stdout>:a00551.science.domain >>>>>> >>>>>> >>>>>> if I login on a00551 and start using the hostfile generated by the >>>>>> PBS_NODEFILE, everything works: >>>>>> >>>>>> (from within the interactive session) >>>>>> echo $PBS_NODEFILE >>>>>> /var/lib/torque/aux//278.a00552.science.domain >>>>>> cat $PBS_NODEFILE >>>>>> a00551.science.domain >>>>>> a00553.science.domain >>>>>> a00551.science.domain >>>>>> >>>>>> (from within the separate login) >>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain >>>>>> -np 3 >>>>>> --tag-output -display-map hostname >>>>>> >>>>>> Data for JOB [65445,1] offset 0 >>>>>> >>>>>> ======================== JOB MAP ======================== >>>>>> >>>>>> Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: >>>>>> 2 >>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: >>>>>> socket >>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core >>>>>> 12[hwt >>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], >>>>>> socket >>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core >>>>>> 17[hwt >>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt >>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] >>>>>> >>>>>> Data for node: a00553.science.domain Num slots: 20 Max slots: >>>>>> 0 >>>>>> Num >>>>>> procs: 1 >>>>>> Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: >>>>>> socket >>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt >>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket >>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt >>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt >>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] >>>>>> >>>>>> ============================================================= >>>>>> [1,0]<stdout>:a00551.science.domain >>>>>> [1,2]<stdout>:a00553.science.domain >>>>>> [1,1]<stdout>:a00551.science.domain >>>>>> >>>>>> I am kind of lost whats going on here. Anyone having an idea? I am >>>>>> seriously considering this to be the problem of kerberos >>>>>> authentification that we have to work with, but I fail to see how >>>>>> this >>>>>> should affect the sockets. >>>>>> >>>>>> Best, >>>>>> Oswin >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >_______________________________________________ >users mailing list >users@lists.open-mpi.org >https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users