Hi,
I having trouble using map by socket on remote nodes.
Running on the same node as mpirun works fine (except for that spurious
debugging line):
$ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
[raijin7:22248] SETTING BINDING TO CORE
Data for JOB [11140,1] offset 0 Total slots allocated 16
======================== JOB MAP ========================
Data for node: raijin7 Num slots: 16 Max slots: 0 Num procs: 4
Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket
0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.]
Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket
0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket
0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.]
Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket
1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket
1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.]
Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket
1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket
1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B]
=============================================================
But the same on a remote node fails in a rather odd fashion:
$ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
[raijin7:22291] SETTING BINDING TO CORE
[r1:10565] SETTING BINDING TO CORE
Data for JOB [10879,1] offset 0 Total slots allocated 32
======================== JOB MAP ========================
Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4
Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A
=============================================================
--------------------------------------------------------------------------
The request to bind processes could not be completed due to
an internal error - the locale of the following process was
not set by the mapper code:
Process: [[10879,1],2]
Please contact the OMPI developers for assistance. Meantime,
you will still be able to run your application without binding
by specifying "--bind-to none" on your command line.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[10879,0],0] on node raijin7
Remote daemon: [[10879,0],1] on node r1
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
On the other hand, mapping by node works fine...
> mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true
[raijin7:22668] SETTING BINDING TO CORE
[r1:10777] SETTING BINDING TO CORE
Data for JOB [9696,1] offset 0 Total slots allocated 32
======================== JOB MAP ========================
Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4
Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A
=============================================================
Data for JOB [9696,1] offset 0 Total slots allocated 32
======================== JOB MAP ========================
Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4
Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]],
socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..]
Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket
0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]],
socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..]
Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket
1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]],
socket 1[core 11[hwt 0-1]]:[../../../../../../../..][BB/BB/BB/BB/../../../..]
Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: socket
1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]],
socket 1[core 15[hwt 0-1]]:[../../../../../../../..][../../../../BB/BB/BB/BB]
=============================================================
Cheers,
Ben
_______________________________________________
devel mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/devel