Could you please send the output from “lstopo --of xml foo.xml” (the file 
foo.xml) so I can try to replicate here?


> On Sep 4, 2018, at 12:35 PM, Shrader, David Lee <dshra...@lanl.gov> wrote:
> 
> Hello,
> 
> I have run this issue by Howard, and he asked me to forward it on to the Open 
> MPI devel mailing list. I get an error when trying to use PE=n with '--map-by 
> numa' and not using span when using more than one node:
> 
> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to 
> core --report-bindings true
> --------------------------------------------------------------------------
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>    Bind to:     CORE
>    Node:        ba001
>    #processes:  2
>    #cpus:       1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --------------------------------------------------------------------------
> 
> The absolute values of the numbers passed to -n and PE don't really matter; 
> the error pops up as soon as those numbers are combined in such a way that an 
> MPI rank ends up on the second node.
> 
> If I add the "span" parameter, everything works as expected:
> 
> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span 
> --bind-to core --report-bindings true
> [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket 
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]], 
> socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]], 
> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]], 
> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
> [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]], 
> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
> [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]], 
> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
> [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]], 
> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
> [./././././././././././././././././.][././././././././././././B/B/B/B/./.]
> [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: 
> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]: 
> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket 
> 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]: 
> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]], 
> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]: 
> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]], 
> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]: 
> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
> [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]], 
> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]: 
> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
> [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]], 
> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]: 
> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
> [ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]], 
> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]: 
> [./././././././././././././././././.][././././././././././././B/B/B/B/./.]
> 
> I would have expected the first command to work in the sense that processes 
> are at least mapped and bound somewhere across the two nodes; is there a 
> particular reason why that doesn't happen?
> 
> I am using Open MPI 3.1.2 in the above examples with only "--prefix" to 
> configure. I am running on two nodes that each have two sockets with 18 
> processors per socket (36 processors per node, no hyper-threading). Hwloc 
> reports that the numa domain is equivalent to a socket on these hosts (thus, 
> replacing "numa" with "socket" in the above examples exhibits the same 
> behavior for me). The interconnect is Omnipath.
> 
> Thank you very much for your time,
> David
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/devel 
> <https://lists.open-mpi.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to