Hi,

I having trouble using map by socket on remote nodes.

Running on the same node as mpirun works fine (except for that spurious 
debugging line):

$ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
[raijin7:22248] SETTING BINDING TO CORE
 Data for JOB [11140,1] offset 0 Total slots allocated 16

 ========================   JOB MAP   ========================

 Data for node: raijin7 Num slots: 16   Max slots: 0    Num procs: 4
        Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 
0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.]
        Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 
0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.]
        Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket 
1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 
1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.]
        Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket 
1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket 
1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B]

 =============================================================
But the same on a remote node fails in a rather odd fashion:

$ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
[raijin7:22291] SETTING BINDING TO CORE
[r1:10565] SETTING BINDING TO CORE
 Data for JOB [10879,1] offset 0 Total slots allocated 32

 ========================   JOB MAP   ========================

 Data for node: r1      Num slots: 16   Max slots: 0    Num procs: 4
        Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A
        Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A
        Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A
        Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A

 =============================================================
--------------------------------------------------------------------------
The request to bind processes could not be completed due to
an internal error - the locale of the following process was
not set by the mapper code:

  Process:  [[10879,1],2]

Please contact the OMPI developers for assistance. Meantime,
you will still be able to run your application without binding
by specifying "--bind-to none" on your command line.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[10879,0],0] on node raijin7
  Remote daemon: [[10879,0],1] on node r1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

On the other hand, mapping by node works fine...

> mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true
[raijin7:22668] SETTING BINDING TO CORE
[r1:10777] SETTING BINDING TO CORE
 Data for JOB [9696,1] offset 0 Total slots allocated 32

 ========================   JOB MAP   ========================

 Data for node: r1      Num slots: 16   Max slots: 0    Num procs: 4
        Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A
        Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A
        Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A
        Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A

 =============================================================
 Data for JOB [9696,1] offset 0 Total slots allocated 32

 ========================   JOB MAP   ========================

 Data for node: r1      Num slots: 16   Max slots: 0    Num procs: 4
        Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], 
socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..]
        Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket 
0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], 
socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..]
        Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket 
1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], 
socket 1[core 11[hwt 0-1]]:[../../../../../../../..][BB/BB/BB/BB/../../../..]
        Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: socket 
1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
socket 1[core 15[hwt 0-1]]:[../../../../../../../..][../../../../BB/BB/BB/BB]

 =============================================================

Cheers,
Ben

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to