While investigating the ongoing issue with OMPI messaging layer, I run into some troubles with process binding. I read the documentation, but I still find this puzzling.
Disclaimer: all experiments were done with current master (9c496f7) compiled in optimized mode. The hardware: a single node 20 core Xeon E5-2650 v3 (hwloc-ls is at the end of this email). First and foremost, trying to bind to NUMA nodes was a sure way to get a segfault: $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true -------------------------------------------------------------------------- No objects of the specified type were found on at least one node: Type: NUMANode Node: arc00 The map cannot be done as specified. -------------------------------------------------------------------------- [dancer:32162] *** Process received signal *** [dancer:32162] Signal: Segmentation fault (11) [dancer:32162] Signal code: Address not mapped (1) [dancer:32162] Failing at address: 0x3c [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0] [dancer:32162] [ 1] /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0] [dancer:32162] [ 2] /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54] [dancer:32162] [ 3] /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308] [dancer:32162] [ 4] /home/bosilca/opt/trunk/fast/lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e] [dancer:32162] [ 5] /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324)[0x7f9075bedca4] [dancer:32162] [ 6] /home/bosilca/opt/trunk/fast/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7f90758eafec] [dancer:32162] [ 7] mpirun[0x401251] [dancer:32162] [ 8] mpirun[0x400e24] [dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x312621ed1d] [dancer:32162] [10] mpirun[0x400d49] [dancer:32162] *** End of error message *** Segmentation fault As you can see on the hwloc output below, there are 2 NUMA nodes on the node and HWLOC correctly identifies them, making OMPI error message confusing. Anyway, we should not segfault but report a more meaning error message. Binding to slot (I got this from the man page for 2.0) is apparently not supported anymore. Reminder: We should update the manpage accordingly. Trying to bind to core looks better, the application at least starts. Unfortunately the reported bindings (or at least my understanding of these bindings) are troubling. Assuming that the way we report the bindings is correct, why are my processes assigned to 2 cores far apart each ? $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true [arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] [arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] Maybe because I only used the binding option. Adding the mapping to the mix (--map-by option) seem hopeless, the binding remains unchanged for 2 processes. $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true [arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] [arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] At this point I really wondered what is going on. To clarify I tried to launch 3 processes on the node. Bummer ! the reported binding shows that one of my processes got assigned to cores on different sockets. $ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true [arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] [arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] [arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core 12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../..] Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the mapping will help. Will I get a more sensible binding (as suggested by our online documentation and the man pages) ? $ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core --report-bindings true [arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] [arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] [arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..] There is a difference. The logical rank of processes is now respected but one of my processes is still bound to 2 cores on different sockets, but these cores are different from the case when the mapping was not specified. Trying to bind on sockets I got an even more confusing outcome. So I went the hard way, what can go wrong if I manually define the binding via a rankfile ? Fail ! My processes continue to report an unsettling bindings (there is some relationship with my rank file but most of the issues I reported above still remain). $ more rankfile rank 0=arc00 slot=0 rank 1=arc00 slot=2 $ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true [arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] [arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..] At this point I got pretty much completely confused with how OMPI binding works. I'm counting on a good samaritan to explain how this works. Thanks, George. PS: rankfile feature of using relative hostnames (+n?) seems to be busted as the example from the man page leads to the following complaint -------------------------------------------------------------------------- A relative host was specified, but no prior allocation has been made. Thus, there is no way to determine the proper host to be used. hostfile entry: +n0 Please see the orte_hosts man page for further information. -------------------------------------------------------------------------- $ hwloc-ls Machine (63GB) NUMANode L#0 (P#0 31GB) Socket L#0 + L3 L#0 (25MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#20) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#21) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#22) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#23) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#24) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#25) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#26) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#27) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#28) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#29) NUMANode L#1 (P#1 31GB) Socket L#1 + L3 L#1 (25MB) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#30) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#31) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#32) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#33) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#34) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#35) L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 PU L#32 (P#16) PU L#33 (P#36) L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 PU L#34 (P#17) PU L#35 (P#37) L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#18) PU L#37 (P#38) L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#19) PU L#39 (P#39)
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel