Could you please send the output from “lstopo --of xml foo.xml” (the file foo.xml) so I can try to replicate here?
> On Sep 4, 2018, at 12:35 PM, Shrader, David Lee <dshra...@lanl.gov> wrote: > > Hello, > > I have run this issue by Howard, and he asked me to forward it on to the Open > MPI devel mailing list. I get an error when trying to use PE=n with '--map-by > numa' and not using span when using more than one node: > > [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to > core --report-bindings true > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: ba001 > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > The absolute values of the numbers passed to -n and PE don't really matter; > the error pops up as soon as those numbers are combined in such a way that an > MPI rank ends up on the second node. > > If I add the "span" parameter, everything works as expected: > > [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span > --bind-to core --report-bindings true > [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: > [B/B/B/B/./././././././././././././.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: > [././././B/B/B/B/./././././././././.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], > socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: > [././././././././B/B/B/B/./././././.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], > socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: > [././././././././././././B/B/B/B/./.][./././././././././././././././././.] > [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], > socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: > [./././././././././././././././././.][B/B/B/B/./././././././././././././.] > [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], > socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: > [./././././././././././././././././.][././././B/B/B/B/./././././././././.] > [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], > socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: > [./././././././././././././././././.][././././././././B/B/B/B/./././././.] > [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], > socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: > [./././././././././././././././././.][././././././././././././B/B/B/B/./.] > [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: > [B/B/B/B/./././././././././././././.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: > [././././B/B/B/B/./././././././././.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket > 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: > [././././././././B/B/B/B/./././././.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], > socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: > [././././././././././././B/B/B/B/./.][./././././././././././././././././.] > [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], > socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: > [./././././././././././././././././.][B/B/B/B/./././././././././././././.] > [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], > socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: > [./././././././././././././././././.][././././B/B/B/B/./././././././././.] > [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], > socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: > [./././././././././././././././././.][././././././././B/B/B/B/./././././.] > [ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]], > socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: > [./././././././././././././././././.][././././././././././././B/B/B/B/./.] > > I would have expected the first command to work in the sense that processes > are at least mapped and bound somewhere across the two nodes; is there a > particular reason why that doesn't happen? > > I am using Open MPI 3.1.2 in the above examples with only "--prefix" to > configure. I am running on two nodes that each have two sockets with 18 > processors per socket (36 processors per node, no hyper-threading). Hwloc > reports that the numa domain is equivalent to a socket on these hosts (thus, > replacing "numa" with "socket" in the above examples exhibits the same > behavior for me). The interconnect is Omnipath. > > Thank you very much for your time, > David > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://lists.open-mpi.org/mailman/listinfo/devel > <https://lists.open-mpi.org/mailman/listinfo/devel>
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel