Someone has done some work there since I last did, but I can see the issue. 
Torque indeed always provides an ordered file - the only way you can get an 
unordered one is for someone to edit it, and that is forbidden - i.e., you get 
what you deserve because you are messing around with a system-defined file :-)

The problem is that Torque internally assigns a “launch ID” which is just the 
integer position of the nodename in the PBS_NODEFILE. So if you modify that 
position, then we get the wrong index - and everything goes down the drain from 
there. In your example, n1.cluster changed index from 3 to 2 because of your 
edit. Torque thinks that index 2 is just another reference to n0.cluster, and 
so we merrily launch a daemon onto the wrong node.

They have a good reason for doing things this way. It allows you to launch a 
process against each launch ID, and the pattern will reflect the original qsub 
request in what we would call a map-by slot round-robin mode. This maximizes 
the use of shared memory, and is expected to provide good performance for a 
range of apps.

Lesson to be learned: never, ever muddle around with a system-generated file. 
If you want to modify where things go, then use one or more of the mpirun 
options to do so. We give you lots and lots of knobs for just that reason.



> On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Ralph,
> 
> 
> there might be an issue within Open MPI.
> 
> 
> on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the 
> FQDN too.
> 
> my $PBS_NODEFILE has one line per task, and it is ordered
> 
> e.g.
> 
> n0.cluster
> 
> n0.cluster
> 
> n1.cluster
> 
> n1.cluster
> 
> 
> in my torque script, i rewrote the machinefile like this
> 
> n0.cluster
> 
> n1.cluster
> 
> n0.cluster
> 
> n1.cluster
> 
> and updated the PBS environment variable to point to my new file.
> 
> 
> then i invoked
> 
> mpirun hostname
> 
> 
> 
> in the first case, 2 tasks run on n0 and 2 tasks run on n1
> in the second case, 4 tasks run on n0, and none on n1.
> 
> so i am thinking we might not support unordered $PBS_NODEFILE.
> 
> as a reminder, the submit command was
> qsub -l nodes=3:ppn=1
> but for some reasons i ignore, only two nodes were allocated (two slots on 
> the first one, one on the second one)
> and if i understand correctly, $PBS_NODEFILE was not ordered.
> (e.g. n0 n1 n0 and *not * n0 n0 n1)
> 
> i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in 
> the queue if only two nodes with 16 slots each are available and i request
> -l nodes=3:ppn=1
> i guess this is a different scheduler configuration, and i cannot change that.
> 
> Could you please have a look at this ?
> 
> Cheers,
> 
> Gilles
> 
> On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:
>> The usual cause of this problem is that the nodename in the machinefile is 
>> given as a00551, while Torque is assigning the node name as 
>> a00551.science.domain. Thus, mpirun thinks those are two separate nodes and 
>> winds up spawning an orted on its own node.
>> 
>> You might try ensuring that your machinefile is using the exact same name as 
>> provided in your allocation
>> 
>> 
>>> On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>> 
>>> Thanjs for the ligs
>>> 
>>> From what i see now, it looks like a00551 is running both mpirun and orted, 
>>> though it should only run mpirun, and orted should run only on a00553
>>> 
>>> I will check the code and see what could be happening here
>>> 
>>> Btw, what is the output of
>>> hostname
>>> hostname -f
>>> On a00551 ?
>>> 
>>> Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) 
>>> installled and running correctly on your cluster ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>> Hi Gilles,
>>>> 
>>>> Thanks for the hint with the machinefile. I know it is not equivalent
>>>> and i do not intend to use that approach. I just wanted to know whether
>>>> I could start the program successfully at all.
>>>> 
>>>> Outside torque(4.2), rsh seems to be used which works fine, querying a
>>>> password if no kerberos ticket is there
>>>> 
>>>> Here is the output:
>>>> [zbh251@a00551 ~]$ mpirun -V
>>>> mpirun (Open MPI) 2.0.1
>>>> [zbh251@a00551 ~]$ ompi_info | grep ras
>>>>                 MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component
>>>> v2.0.1)
>>>>                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
>>>> v2.0.1)
>>>>                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
>>>> v2.0.1)
>>>>                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>>>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
>>>> -display-map hostname
>>>> [a00551.science.domain:04104] mca: base: components_register:
>>>> registering framework plm components
>>>> [a00551.science.domain:04104] mca: base: components_register: found
>>>> loaded component isolated
>>>> [a00551.science.domain:04104] mca: base: components_register: component
>>>> isolated has no register or open function
>>>> [a00551.science.domain:04104] mca: base: components_register: found
>>>> loaded component rsh
>>>> [a00551.science.domain:04104] mca: base: components_register: component
>>>> rsh register function successful
>>>> [a00551.science.domain:04104] mca: base: components_register: found
>>>> loaded component slurm
>>>> [a00551.science.domain:04104] mca: base: components_register: component
>>>> slurm register function successful
>>>> [a00551.science.domain:04104] mca: base: components_register: found
>>>> loaded component tm
>>>> [a00551.science.domain:04104] mca: base: components_register: component
>>>> tm register function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: opening plm
>>>> components
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component isolated
>>>> [a00551.science.domain:04104] mca: base: components_open: component
>>>> isolated open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component rsh
>>>> [a00551.science.domain:04104] mca: base: components_open: component rsh
>>>> open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component slurm
>>>> [a00551.science.domain:04104] mca: base: components_open: component
>>>> slurm open function successful
>>>> [a00551.science.domain:04104] mca: base: components_open: found loaded
>>>> component tm
>>>> [a00551.science.domain:04104] mca: base: components_open: component tm
>>>> open function successful
>>>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm
>>>> components
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [isolated]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>>>> [isolated] set priority to 0
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [rsh]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>>>> [rsh] set priority to 10
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [slurm]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component
>>>> [tm]
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component
>>>> [tm] set priority to 75
>>>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected component
>>>> [tm]
>>>> [a00551.science.domain:04104] mca: base: close: component isolated
>>>> closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component
>>>> isolated
>>>> [a00551.science.domain:04104] mca: base: close: component rsh closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component rsh
>>>> [a00551.science.domain:04104] mca: base: close: component slurm closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component
>>>> slurm
>>>> [a00551.science.domain:04109] mca: base: components_register:
>>>> registering framework plm components
>>>> [a00551.science.domain:04109] mca: base: components_register: found
>>>> loaded component rsh
>>>> [a00551.science.domain:04109] mca: base: components_register: component
>>>> rsh register function successful
>>>> [a00551.science.domain:04109] mca: base: components_open: opening plm
>>>> components
>>>> [a00551.science.domain:04109] mca: base: components_open: found loaded
>>>> component rsh
>>>> [a00551.science.domain:04109] mca: base: components_open: component rsh
>>>> open function successful
>>>> [a00551.science.domain:04109] mca:base:select: Auto-selecting plm
>>>> components
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Querying component
>>>> [rsh]
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Query of component
>>>> [rsh] set priority to 10
>>>> [a00551.science.domain:04109] mca:base:select:(  plm) Selected component
>>>> [rsh]
>>>> [a00551.science.domain:04109] [[53688,0],1] bind() failed on error
>>>> Address already in use (98)
>>>> [a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error in
>>>> file oob_usock_component.c at line 228
>>>> Data for JOB [53688,1] offset 0
>>>> 
>>>> ========================   JOB MAP   ========================
>>>> 
>>>> Data for node: a00551      Num slots: 2    Max slots: 0    Num procs: 2
>>>>    Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>    Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket
>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>> 
>>>> Data for node: a00553.science.domain       Num slots: 1    Max slots: 0    
>>>> Num
>>>> procs: 1
>>>>    Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket
>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>> 
>>>> =============================================================
>>>> [a00551.science.domain:04104] [[53688,0],0] complete_setup on job
>>>> [53688,1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
>>>> state command from [[53688,0],1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>> update_proc_state for job [53688,1]
>>>> [1,0]<stdout>:a00551.science.domain
>>>> [1,2]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive update proc
>>>> state command from [[53688,0],1]
>>>> [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got
>>>> update_proc_state for job [53688,1]
>>>> [1,1]<stdout>:a00551.science.domain
>>>> [a00551.science.domain:04109] mca: base: close: component rsh closed
>>>> [a00551.science.domain:04109] mca: base: close: unloading component rsh
>>>> [a00551.science.domain:04104] mca: base: close: component tm closed
>>>> [a00551.science.domain:04104] mca: base: close: unloading component tm
>>>> 
>>>> On 2016-09-07 14:41, Gilles Gouaillardet wrote:
>>>>> Hi,
>>>>> 
>>>>> Which version of Open MPI are you running ?
>>>>> 
>>>>> I noted that though you are asking three nodes and one task per node,
>>>>> you have been allocated 2 nodes only.
>>>>> I do not know if this is related to this issue.
>>>>> 
>>>>> Note if you use the machinefile, a00551 has two slots (since it
>>>>> appears twice in the machinefile) but a00553 has 20 slots (since it
>>>>> appears once in the machinefile, the number of slots is automatically
>>>>> detected)
>>>>> 
>>>>> Can you run
>>>>> mpirun --mca plm_base_verbose 10 ...
>>>>> So we can confirm tm is used.
>>>>> 
>>>>> Before invoking mpirun, you might want to cleanup the ompi directory in
>>>>> /tmp
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I am currently trying to set up OpenMPI in torque. OpenMPI is build
>>>>>> with
>>>>>> tm support. Torque is correctly assigning nodes and I can run
>>>>>> mpi-programs on single nodes just fine. the problem starts when
>>>>>> processes are split between nodes.
>>>>>> 
>>>>>> For example, I create an interactive session with torque and start a
>>>>>> program by
>>>>>> 
>>>>>> qsub -I -n -l nodes=3:ppn=1
>>>>>> mpirun --tag-output -display-map hostname
>>>>>> 
>>>>>> which leads to
>>>>>> [a00551.science.domain:15932] [[65415,0],1] bind() failed on error
>>>>>> Address already in use (98)
>>>>>> [a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error in
>>>>>> file oob_usock_component.c at line 228
>>>>>> Data for JOB [65415,1] offset 0
>>>>>> 
>>>>>> ========================   JOB MAP   ========================
>>>>>> 
>>>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num procs: 2
>>>>>>  Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>  Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket
>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>> 
>>>>>> Data for node: a00553.science.domain     Num slots: 1    Max slots: 0    
>>>>>> Num
>>>>>> procs: 1
>>>>>>  Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>> 
>>>>>> =============================================================
>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>> [1,2]<stdout>:a00551.science.domain
>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>> 
>>>>>> 
>>>>>> if I login on a00551 and start using the hostfile generated by the
>>>>>> PBS_NODEFILE, everything works:
>>>>>> 
>>>>>> (from within the interactive session)
>>>>>> echo $PBS_NODEFILE
>>>>>> /var/lib/torque/aux//278.a00552.science.domain
>>>>>> cat $PBS_NODEFILE
>>>>>> a00551.science.domain
>>>>>> a00553.science.domain
>>>>>> a00551.science.domain
>>>>>> 
>>>>>> (from within the separate login)
>>>>>> mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3
>>>>>> --tag-output -display-map hostname
>>>>>> 
>>>>>> Data for JOB [65445,1] offset 0
>>>>>> 
>>>>>> ========================   JOB MAP   ========================
>>>>>> 
>>>>>> Data for node: a00551    Num slots: 2    Max slots: 0    Num procs: 2
>>>>>>  Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>>  Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket
>>>>>> 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt
>>>>>> 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket
>>>>>> 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt
>>>>>> 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt
>>>>>> 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
>>>>>> 
>>>>>> Data for node: a00553.science.domain     Num slots: 20   Max slots: 0    
>>>>>> Num
>>>>>> procs: 1
>>>>>>  Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket
>>>>>> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt
>>>>>> 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket
>>>>>> 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt
>>>>>> 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt
>>>>>> 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
>>>>>> 
>>>>>> =============================================================
>>>>>> [1,0]<stdout>:a00551.science.domain
>>>>>> [1,2]<stdout>:a00553.science.domain
>>>>>> [1,1]<stdout>:a00551.science.domain
>>>>>> 
>>>>>> I am kind of lost whats going on here. Anyone having an idea? I am
>>>>>> seriously considering this to be the problem of kerberos
>>>>>> authentification that we have to work with, but I fail to see how this
>>>>>> should affect the sockets.
>>>>>> 
>>>>>> Best,
>>>>>> Oswin
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to