Found the bug - see https://github.com/open-mpi/ompi/pull/4291 
<https://github.com/open-mpi/ompi/pull/4291>

Will PR for the next 3.0.x release

> On Oct 2, 2017, at 9:55 PM, Ben Menadue <ben.mena...@nci.org.au> wrote:
> 
> Hi,
> 
> I having trouble using map by socket on remote nodes.
> 
> Running on the same node as mpirun works fine (except for that spurious 
> debugging line):
> 
> $ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
> [raijin7:22248] SETTING BINDING TO CORE
>  Data for JOB [11140,1] offset 0 Total slots allocated 16
> 
>  ========================   JOB MAP   ========================
> 
>  Data for node: raijin7       Num slots: 16   Max slots: 0    Num procs: 4
>       Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket 
> 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 
> 0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.]
>       Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket 
> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 
> 0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.]
>       Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket 
> 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 
> 1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.]
>       Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket 
> 1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket 
> 1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B]
> 
>  =============================================================
> But the same on a remote node fails in a rather odd fashion:
> 
> $ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
> [raijin7:22291] SETTING BINDING TO CORE
> [r1:10565] SETTING BINDING TO CORE
>  Data for JOB [10879,1] offset 0 Total slots allocated 32
> 
>  ========================   JOB MAP   ========================
> 
>  Data for node: r1    Num slots: 16   Max slots: 0    Num procs: 4
>       Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A
>       Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A
>       Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A
>       Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A
> 
>  =============================================================
> --------------------------------------------------------------------------
> The request to bind processes could not be completed due to
> an internal error - the locale of the following process was
> not set by the mapper code:
> 
>   Process:  [[10879,1],2]
> 
> Please contact the OMPI developers for assistance. Meantime,
> you will still be able to run your application without binding
> by specifying "--bind-to none" on your command line.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> ORTE has lost communication with a remote daemon.
> 
>   HNP daemon   : [[10879,0],0] on node raijin7
>   Remote daemon: [[10879,0],1] on node r1
> 
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
> --------------------------------------------------------------------------
> 
> On the other hand, mapping by node works fine...
> 
> > mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true
> [raijin7:22668] SETTING BINDING TO CORE
> [r1:10777] SETTING BINDING TO CORE
>  Data for JOB [9696,1] offset 0 Total slots allocated 32
> 
>  ========================   JOB MAP   ========================
> 
>  Data for node: r1    Num slots: 16   Max slots: 0    Num procs: 4
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A
> 
>  =============================================================
>  Data for JOB [9696,1] offset 0 Total slots allocated 32
> 
>  ========================   JOB MAP   ========================
> 
>  Data for node: r1    Num slots: 16   Max slots: 0    Num procs: 4
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket 
> 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], 
> socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..]
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket 
> 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], 
> socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..]
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket 
> 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], 
> socket 1[core 11[hwt 0-1]]:[../../../../../../../..][BB/BB/BB/BB/../../../..]
>       Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: socket 
> 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
> socket 1[core 15[hwt 0-1]]:[../../../../../../../..][../../../../BB/BB/BB/BB]
> 
>  =============================================================
> 
> Cheers,
> Ben
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to