[OMPI devel] Question about Open MPI bindings

George Bosilca Fri, 02 Sep 2016 21:10:37 -0700

While investigating the ongoing issue with OMPI messaging layer, I run into
some troubles with process binding. I read the documentation, but I still
find this puzzling.


Disclaimer: all experiments were done with current master (9c496f7)
compiled in optimized mode. The hardware: a single node 20 core
Xeon E5-2650 v3 (hwloc-ls is at the end of this email).

First and foremost, trying to bind to NUMA nodes was a sure way to get a
segfault:

$ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
--------------------------------------------------------------------------
No objects of the specified type were found on at least one node:

  Type: NUMANode
  Node: arc00

The map cannot be done as specified.
--------------------------------------------------------------------------
[dancer:32162] *** Process received signal ***
[dancer:32162] Signal: Segmentation fault (11)
[dancer:32162] Signal code: Address not mapped (1)
[dancer:32162] Failing at address: 0x3c
[dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
[dancer:32162] [ 1]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
[dancer:32162] [ 2]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
[dancer:32162] [ 3]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
[dancer:32162] [ 4]
/home/bosilca/opt/trunk/fast/lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
[dancer:32162] [ 5]
/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324)[0x7f9075bedca4]
[dancer:32162] [ 6]
/home/bosilca/opt/trunk/fast/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7f90758eafec]
[dancer:32162] [ 7] mpirun[0x401251]
[dancer:32162] [ 8] mpirun[0x400e24]
[dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x312621ed1d]
[dancer:32162] [10] mpirun[0x400d49]
[dancer:32162] *** End of error message ***
Segmentation fault

As you can see on the hwloc output below, there are 2 NUMA nodes on the
node and HWLOC correctly identifies them, making OMPI error message
confusing. Anyway, we should not segfault but report a more meaning error
message.

Binding to slot (I got this from the man page for 2.0) is apparently not
supported anymore. Reminder: We should update the manpage accordingly.

Trying to bind to core looks better, the application at least starts.
Unfortunately the reported bindings (or at least my understanding of these
bindings) are troubling. Assuming that the way we report the bindings is
correct, why are my processes assigned to 2 cores far apart each ?

$ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
[arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]

Maybe because I only used the binding option. Adding the mapping to the mix
(--map-by option) seem hopeless, the binding remains unchanged for 2
processes.

$ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
[arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]

At this point I really wondered what is going on. To clarify I tried to
launch 3 processes on the node. Bummer ! the reported binding shows that
one of my processes got assigned to cores on different sockets.

$ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true
[arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
[arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core
12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../..]

Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the mapping
will help. Will I get a more sensible binding (as suggested by our online
documentation and the man pages) ?

$ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core
--report-bindings true
[arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
[arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core
10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..]

There is a difference. The logical rank of processes is now respected but
one of my processes is still bound to 2 cores on different sockets, but
these cores are different from the case when the mapping was not specified.

Trying to bind on sockets I got an even more confusing outcome. So I went
the hard way, what can go wrong if I manually define the binding via a
rankfile ? Fail ! My processes continue to report an unsettling bindings
(there is some relationship with my rank file but most of the issues I
reported above still remain).

$ more rankfile
rank 0=arc00 slot=0
rank 1=arc00 slot=2
$ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true
[arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
[arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core
10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..]

At this point I got pretty much completely confused with how OMPI binding
works. I'm counting on a good samaritan to explain how this works.

Thanks,
  George.

PS: rankfile feature of using relative hostnames (+n?) seems to be busted
as the example from the man page leads to the following complaint

--------------------------------------------------------------------------
A relative host was specified, but no prior allocation has been made.
Thus, there is no way to determine the proper host to be used.

hostfile entry: +n0

Please see the orte_hosts man page for further information.
--------------------------------------------------------------------------


$ hwloc-ls
Machine (63GB)
  NUMANode L#0 (P#0 31GB)
    Socket L#0 + L3 L#0 (25MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#20)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#21)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#22)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#23)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#24)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#25)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#26)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#27)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#28)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#29)
  NUMANode L#1 (P#1 31GB)
    Socket L#1 + L3 L#1 (25MB)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#30)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#31)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#32)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#33)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#34)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#35)
      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#36)
      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#37)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#38)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#39)

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Question about Open MPI bindings

Reply via email to