Found the bug - see https://github.com/open-mpi/ompi/pull/4291 <https://github.com/open-mpi/ompi/pull/4291>
Will PR for the next 3.0.x release > On Oct 2, 2017, at 9:55 PM, Ben Menadue <ben.mena...@nci.org.au> wrote: > > Hi, > > I having trouble using map by socket on remote nodes. > > Running on the same node as mpirun works fine (except for that spurious > debugging line): > > $ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true > [raijin7:22248] SETTING BINDING TO CORE > Data for JOB [11140,1] offset 0 Total slots allocated 16 > > ======================== JOB MAP ======================== > > Data for node: raijin7 Num slots: 16 Max slots: 0 Num procs: 4 > Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket > 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket > 0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.] > Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket > 0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.] > Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket > 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket > 1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.] > Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket > 1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket > 1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B] > > ============================================================= > But the same on a remote node fails in a rather odd fashion: > > $ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true > [raijin7:22291] SETTING BINDING TO CORE > [r1:10565] SETTING BINDING TO CORE > Data for JOB [10879,1] offset 0 Total slots allocated 32 > > ======================== JOB MAP ======================== > > Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4 > Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A > Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A > Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A > Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A > > ============================================================= > -------------------------------------------------------------------------- > The request to bind processes could not be completed due to > an internal error - the locale of the following process was > not set by the mapper code: > > Process: [[10879,1],2] > > Please contact the OMPI developers for assistance. Meantime, > you will still be able to run your application without binding > by specifying "--bind-to none" on your command line. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > ORTE has lost communication with a remote daemon. > > HNP daemon : [[10879,0],0] on node raijin7 > Remote daemon: [[10879,0],1] on node r1 > > This is usually due to either a failure of the TCP network > connection to the node, or possibly an internal failure of > the daemon itself. We cannot recover from this failure, and > therefore will terminate the job. > -------------------------------------------------------------------------- > > On the other hand, mapping by node works fine... > > > mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true > [raijin7:22668] SETTING BINDING TO CORE > [r1:10777] SETTING BINDING TO CORE > Data for JOB [9696,1] offset 0 Total slots allocated 32 > > ======================== JOB MAP ======================== > > Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4 > Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A > Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A > Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A > Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A > > ============================================================= > Data for JOB [9696,1] offset 0 Total slots allocated 32 > > ======================== JOB MAP ======================== > > Data for node: r1 Num slots: 16 Max slots: 0 Num procs: 4 > Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket > 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], > socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..] > Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket > 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], > socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..] > Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket > 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], > socket 1[core 11[hwt 0-1]]:[../../../../../../../..][BB/BB/BB/BB/../../../..] > Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: socket > 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], > socket 1[core 15[hwt 0-1]]:[../../../../../../../..][../../../../BB/BB/BB/BB] > > ============================================================= > > Cheers, > Ben > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel