Hi, I having trouble using map by socket on remote nodes.
Running on the same node as mpirun works fine (except for that spurious debugging line): $ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true [raijin7:22248] SETTING BINDING TO CORE Data for JOB [11140,1] offset 0 Total slots allocated 16 ======================== JOB MAP ======================== Data for node: raijin7 Num slots: 16 Max slots: 0 Num procs: 4 Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.] Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.] Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.] Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket 1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket 1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B] ============================================================= But the same on a remote node fails in a rather odd fashion: $ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true [raijin7:22291] SETTING BINDING TO CORE [r1:10565] SETTING BINDING TO CORE Data for JOB [10879,1] offset 0 Total slots allocated 32 ======================== JOB MAP ======================== Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4 Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A ============================================================= -------------------------------------------------------------------------- The request to bind processes could not be completed due to an internal error - the locale of the following process was not set by the mapper code: Process: [[10879,1],2] Please contact the OMPI developers for assistance. Meantime, you will still be able to run your application without binding by specifying "--bind-to none" on your command line. -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE has lost communication with a remote daemon. HNP daemon : [[10879,0],0] on node raijin7 Remote daemon: [[10879,0],1] on node r1 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- On the other hand, mapping by node works fine... > mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true [raijin7:22668] SETTING BINDING TO CORE [r1:10777] SETTING BINDING TO CORE Data for JOB [9696,1] offset 0 Total slots allocated 32 ======================== JOB MAP ======================== Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4 Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A ============================================================= Data for JOB [9696,1] offset 0 Total slots allocated 32 ======================== JOB MAP ======================== Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4 Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..] Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..] Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]:[../../../../../../../..][BB/BB/BB/BB/../../../..] Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]:[../../../../../../../..][../../../../BB/BB/BB/BB] ============================================================= Cheers, Ben
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel