Could you please send the output from “lstopo --of xml foo.xml” (the file
foo.xml) so I can try to replicate here?
> On Sep 4, 2018, at 12:35 PM, Shrader, David Lee wrote:
>
> Hello,
>
> I have run this issue by Howard, and he asked me to forward it on to the Open
> MPI devel mailing list. I get an error when trying to use PE=n with '--map-by
> numa' and not using span when using more than one node:
>
> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4 --bind-to
> core --report-bindings true
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
>Bind to: CORE
>Node:ba001
>#processes: 2
>#cpus: 1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
>
> The absolute values of the numbers passed to -n and PE don't really matter;
> the error pops up as soon as those numbers are combined in such a way that an
> MPI rank ends up on the second node.
>
> If I add the "span" parameter, everything works as expected:
>
> [dshrader@ba001 openmpi-3.1.2]$ mpirun -n 16 --map-by numa:PE=4,span
> --bind-to core --report-bindings true
> [ba002.localdomain:58502] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 9 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]:
> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 10 bound to socket 0[core 8[hwt 0]],
> socket 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:
> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 11 bound to socket 0[core 12[hwt 0]],
> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]:
> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
> [ba002.localdomain:58502] MCW rank 12 bound to socket 1[core 18[hwt 0]],
> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]:
> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
> [ba002.localdomain:58502] MCW rank 13 bound to socket 1[core 22[hwt 0]],
> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]:
> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
> [ba002.localdomain:58502] MCW rank 14 bound to socket 1[core 26[hwt 0]],
> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]:
> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
> [ba002.localdomain:58502] MCW rank 15 bound to socket 1[core 30[hwt 0]],
> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]:
> [./././././././././././././././././.][././././././././././././B/B/B/B/./.]
> [ba001.localdomain:11700] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]:
> [B/B/B/B/./././././././././././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 0[core 7[hwt 0]]:
> [././././B/B/B/B/./././././././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 2 bound to socket 0[core 8[hwt 0]], socket
> 0[core 9[hwt 0]], socket 0[core 10[hwt 0]], socket 0[core 11[hwt 0]]:
> [././././././././B/B/B/B/./././././.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 3 bound to socket 0[core 12[hwt 0]],
> socket 0[core 13[hwt 0]], socket 0[core 14[hwt 0]], socket 0[core 15[hwt 0]]:
> [././././././././././././B/B/B/B/./.][./././././././././././././././././.]
> [ba001.localdomain:11700] MCW rank 4 bound to socket 1[core 18[hwt 0]],
> socket 1[core 19[hwt 0]], socket 1[core 20[hwt 0]], socket 1[core 21[hwt 0]]:
> [./././././././././././././././././.][B/B/B/B/./././././././././././././.]
> [ba001.localdomain:11700] MCW rank 5 bound to socket 1[core 22[hwt 0]],
> socket 1[core 23[hwt 0]], socket 1[core 24[hwt 0]], socket 1[core 25[hwt 0]]:
> [./././././././././././././././././.][././././B/B/B/B/./././././././././.]
> [ba001.localdomain:11700] MCW rank 6 bound to socket 1[core 26[hwt 0]],
> socket 1[core 27[hwt 0]], socket 1[core 28[hwt 0]], socket 1[core 29[hwt 0]]:
> [./././././././././././././././././.][././././././././B/B/B/B/./././././.]
> [ba001.localdomain:11700] MCW rank 7 bound to socket 1[core 30[hwt 0]],
> socket 1[core 31[hwt 0]], socket 1[core 32[hwt 0]], socket 1[core 33[hwt 0]]:
> [.