George,

I cannot help much with this i am afraid

My best bet would be to rebuild OpenMPI with --enable-debug and an external 
recent hwloc (iirc hwloc v2 cannot be used in Open MPI yet)

You might also want to try
mpirun --tag-output --bind-to xxx --report-bindings grep Cpus_allowed_list 
/proc/self/status

So you can confirm both openmpi and /proc/self/status report the same thing

Hope this helps a bit ...

Gilles

George Bosilca <bosi...@icl.utk.edu> wrote:
>While investigating the ongoing issue with OMPI messaging layer, I run into 
>some troubles with process binding. I read the documentation, but I still find 
>this puzzling.
>
>
>Disclaimer: all experiments were done with current master (9c496f7) compiled 
>in optimized mode. The hardware: a single node 20 core Xeon E5-2650 v3 
>(hwloc-ls is at the end of this email).
>
>
>First and foremost, trying to bind to NUMA nodes was a sure way to get a 
>segfault:
>
>
>$ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
>
>--------------------------------------------------------------------------
>
>No objects of the specified type were found on at least one node:
>
>
>  Type: NUMANode
>
>  Node: arc00
>
>
>The map cannot be done as specified.
>
>--------------------------------------------------------------------------
>
>[dancer:32162] *** Process received signal ***
>
>[dancer:32162] Signal: Segmentation fault (11)
>
>[dancer:32162] Signal code: Address not mapped (1)
>
>[dancer:32162] Failing at address: 0x3c
>
>[dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
>
>[dancer:32162] [ 1] 
>/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
>
>[dancer:32162] [ 2] 
>/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
>
>[dancer:32162] [ 3] 
>/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
>
>[dancer:32162] [ 4] 
>/home/bosilca/opt/trunk/fast/lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
>
>[dancer:32162] [ 5] 
>/home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324)[0x7f9075bedca4]
>
>[dancer:32162] [ 6] 
>/home/bosilca/opt/trunk/fast/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7f90758eafec]
>
>[dancer:32162] [ 7] mpirun[0x401251]
>
>[dancer:32162] [ 8] mpirun[0x400e24]
>
>[dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x312621ed1d]
>
>[dancer:32162] [10] mpirun[0x400d49]
>
>[dancer:32162] *** End of error message ***
>
>Segmentation fault
>
>
>As you can see on the hwloc output below, there are 2 NUMA nodes on the node 
>and HWLOC correctly identifies them, making OMPI error message confusing. 
>Anyway, we should not segfault but report a more meaning error message.
>
>
>Binding to slot (I got this from the man page for 2.0) is apparently not 
>supported anymore. Reminder: We should update the manpage accordingly.
>
>
>Trying to bind to core looks better, the application at least starts. 
>Unfortunately the reported bindings (or at least my understanding of these 
>bindings) are troubling. Assuming that the way we report the bindings is 
>correct, why are my processes assigned to 2 cores far apart each ?
>
>
>$ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
>
>[arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 
>0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>
>[arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 
>0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>
>
>Maybe because I only used the binding option. Adding the mapping to the mix 
>(--map-by option) seem hopeless, the binding remains unchanged for 2 processes.
>
>
>$ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
>
>[arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 
>0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>
>[arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 
>0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>
>
>At this point I really wondered what is going on. To clarify I tried to launch 
>3 processes on the node. Bummer ! the reported binding shows that one of my 
>processes got assigned to cores on different sockets.
>
>
>$ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true
>
>[arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 
>0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>
>[arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 
>0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>
>[arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core 
>12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../..]
>
>
>Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the mapping 
>will help. Will I get a more sensible binding (as suggested by our online 
>documentation and the man pages) ?
>
>
>$ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core 
>--report-bindings true
>
>[arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 
>0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>
>[arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core 9[hwt 
>0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>
>[arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core 
>10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..]
>
>
>There is a difference. The logical rank of processes is now respected but one 
>of my processes is still bound to 2 cores on different sockets, but these 
>cores are different from the case when the mapping was not specified. 
>
>
>Trying to bind on sockets I got an even more confusing outcome. So I went the 
>hard way, what can go wrong if I manually define the binding via a rankfile ? 
>Fail ! My processes continue to report an unsettling bindings (there is some 
>relationship with my rank file but most of the issues I reported above still 
>remain).
>
>
>$ more rankfile
>
>rank 0=arc00 slot=0
>
>rank 1=arc00 slot=2
>
>$ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true
>
>[arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 8[hwt 
>0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>
>[arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core 
>10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..]
>
>
>At this point I got pretty much completely confused with how OMPI binding 
>works. I'm counting on a good samaritan to explain how this works.
>
>
>Thanks,
>
>  George.
>
>
>PS: rankfile feature of using relative hostnames (+n?) seems to be busted as 
>the example from the man page leads to the following complaint
>
>
>--------------------------------------------------------------------------
>
>A relative host was specified, but no prior allocation has been made.
>
>Thus, there is no way to determine the proper host to be used.
>
>
>hostfile entry: +n0
>
>
>Please see the orte_hosts man page for further information.
>
>--------------------------------------------------------------------------
>
>
>
>$ hwloc-ls
>
>Machine (63GB)
>
>  NUMANode L#0 (P#0 31GB)
>
>    Socket L#0 + L3 L#0 (25MB)
>
>      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>
>        PU L#0 (P#0)
>
>        PU L#1 (P#20)
>
>      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>
>        PU L#2 (P#1)
>
>        PU L#3 (P#21)
>
>      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>
>        PU L#4 (P#2)
>
>        PU L#5 (P#22)
>
>      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>
>        PU L#6 (P#3)
>
>        PU L#7 (P#23)
>
>      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>
>        PU L#8 (P#4)
>
>        PU L#9 (P#24)
>
>      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>
>        PU L#10 (P#5)
>
>        PU L#11 (P#25)
>
>      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>
>        PU L#12 (P#6)
>
>        PU L#13 (P#26)
>
>      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>
>        PU L#14 (P#7)
>
>        PU L#15 (P#27)
>
>      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>
>        PU L#16 (P#8)
>
>        PU L#17 (P#28)
>
>      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>
>        PU L#18 (P#9)
>
>        PU L#19 (P#29)
>
>  NUMANode L#1 (P#1 31GB)
>
>    Socket L#1 + L3 L#1 (25MB)
>
>      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>
>        PU L#20 (P#10)
>
>        PU L#21 (P#30)
>
>      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>
>        PU L#22 (P#11)
>
>        PU L#23 (P#31)
>
>      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>
>        PU L#24 (P#12)
>
>        PU L#25 (P#32)
>
>      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>
>        PU L#26 (P#13)
>
>        PU L#27 (P#33)
>
>      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>
>        PU L#28 (P#14)
>
>        PU L#29 (P#34)
>
>      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>
>        PU L#30 (P#15)
>
>        PU L#31 (P#35)
>
>      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
>
>        PU L#32 (P#16)
>
>        PU L#33 (P#36)
>
>      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
>
>        PU L#34 (P#17)
>
>        PU L#35 (P#37)
>
>      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
>
>        PU L#36 (P#18)
>
>        PU L#37 (P#38)
>
>      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
>
>        PU L#38 (P#19)
>
>        PU L#39 (P#39)
>
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to