Re: [OMPI devel] Question about Open MPI bindings

George Bosilca Sat, 03 Sep 2016 07:16:49 -0700

$mpirun -np 3 --tag-output --bind-to core --report-bindings
--display-devel-map --mca rmaps_base_verbose 10 true


[dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities
> [dancer.icl.utk.edu:17451]      Mapper: ppr Priority: 90
> [dancer.icl.utk.edu:17451]      Mapper: seq Priority: 60
> [dancer.icl.utk.edu:17451]      Mapper: resilient Priority: 40
> [dancer.icl.utk.edu:17451]      Mapper: mindist Priority: 20
> [dancer.icl.utk.edu:17451]      Mapper: round_robin Priority: 10
> [dancer.icl.utk.edu:17451]      Mapper: staged Priority: 5
> [dancer.icl.utk.edu:17451]      Mapper: rank_file Priority: 0
> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1]
> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job
> [41198,1]
> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user - using
> bysocket
> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr
> mapper PPR NULL policy PPR NOTSET
> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq
> mapper
> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial
> map of job [41198,1] - no fault groups
> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using
> mindist mapper
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1]
> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING:
> [dancer.icl.utk.edu:17451]     node: arc00 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc01 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc02 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc03 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc04 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc05 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc06 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc07 daemon: NULL
> [dancer.icl.utk.edu:17451]     node: arc08 daemon: NULL
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for
> job [41198,1] slots 180 num_procs 3
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on node
> arc00
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: calculated nprocs 20
> [dancer.icl.utk.edu:17451] mca:rmaps:rr: assigning nprocs 20
> [dancer.icl.utk.edu:17451] mca:rmaps:base: computing vpids by slot for
> job [41198,1]
> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 0 to node arc00
> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 1 to node arc00
> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 2 to node arc00
> [dancer.icl.utk.edu:17451] mca:rmaps: compute bindings for job [41198,1]
> with policy CORE[4008]
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: node arc00 has 3
> procs on it
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
> [[41198,1],0]
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
> [[41198,1],1]
> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
> [[41198,1],2]
> [dancer.icl.utk.edu:17451] [[41198,0],0] bind_depth: 5 map_depth 1
> [dancer.icl.utk.edu:17451] mca:rmaps: bind downward for job [41198,1]
> with bindings CORE
> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],0] BITMAP 0,8
> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],0][arc00]
> TO socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..
> ]
> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],1] BITMAP 4,12
> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],1][arc00]
> TO socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..
> ]
> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],2] BITMAP 1,9
> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],2][arc00]
> TO socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..
> ]
> [1,0]<stderr>:[arc00:07612] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..
> ][../../../../../../../../../..]
> [1,1]<stderr>:[arc00:07612] MCW rank 1 bound to socket 0[core 4[hwt 0]],
> socket 1[core 12[hwt 0]]: [../../../../B./../../../../.
> .][../../B./../../../../../../..]
> [1,2]<stderr>:[arc00:07612] MCW rank 2 bound to socket 0[core 1[hwt 0]],
> socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.
> ][../../../../../../../../../..]


On Sat, Sep 3, 2016 at 9:44 AM, [email protected] <[email protected]> wrote:

> Okay, can you add --display-devel-map --mca rmaps_base_verbose 10 to your
> cmd line?
>
> It sounds like there is something about that topo that is bothering the
> mapper
>
> On Sep 2, 2016, at 9:31 PM, George Bosilca <[email protected]> wrote:
>
> Thanks Gilles, that's a very useful trick. The bindings reported by ORTE
> are in sync with the one reported by the OS.
>
> $ mpirun -np 2 --tag-output --bind-to core --report-bindings grep
> Cpus_allowed_list /proc/self/status
> [1,0]<stderr>:[arc00:90813] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket 0[core 4[hwt 0]]: [B./../../../B./../../../../..
> ][../../../../../../../../../..]
> [1,1]<stderr>:[arc00:90813] MCW rank 1 bound to socket 1[core 10[hwt 0]],
> socket 1[core 14[hwt 0]]: [../../../../../../../../../..
> ][B./../../../B./../../../../..]
> [1,0]<stdout>:Cpus_allowed_list:        0,8
> [1,1]<stdout>:Cpus_allowed_list:        1,9
>
> George.
>
>
>
> On Sat, Sep 3, 2016 at 12:27 AM, Gilles Gouaillardet <
> [email protected]> wrote:
>
>> George,
>>
>> I cannot help much with this i am afraid
>>
>> My best bet would be to rebuild OpenMPI with --enable-debug and an
>> external recent hwloc (iirc hwloc v2 cannot be used in Open MPI yet)
>>
>> You might also want to try
>> mpirun --tag-output --bind-to xxx --report-bindings grep
>> Cpus_allowed_list /proc/self/status
>>
>> So you can confirm both openmpi and /proc/self/status report the same
>> thing
>>
>> Hope this helps a bit ...
>>
>> Gilles
>>
>>
>> George Bosilca <[email protected]> wrote:
>> While investigating the ongoing issue with OMPI messaging layer, I run
>> into some troubles with process binding. I read the documentation, but I
>> still find this puzzling.
>>
>> Disclaimer: all experiments were done with current master (9c496f7)
>> compiled in optimized mode. The hardware: a single node 20 core
>> Xeon E5-2650 v3 (hwloc-ls is at the end of this email).
>>
>> First and foremost, trying to bind to NUMA nodes was a sure way to get a
>> segfault:
>>
>> $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
>> ------------------------------------------------------------
>> --------------
>> No objects of the specified type were found on at least one node:
>>
>>   Type: NUMANode
>>   Node: arc00
>>
>> The map cannot be done as specified.
>> ------------------------------------------------------------
>> --------------
>> [dancer:32162] *** Process received signal ***
>> [dancer:32162] Signal: Segmentation fault (11)
>> [dancer:32162] Signal code: Address not mapped (1)
>> [dancer:32162] Failing at address: 0x3c
>> [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
>> [dancer:32162] [ 1] /home/bosilca/opt/trunk/fast/l
>> ib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
>> [dancer:32162] [ 2] /home/bosilca/opt/trunk/fast/l
>> ib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
>> [dancer:32162] [ 3] /home/bosilca/opt/trunk/fast/l
>> ib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
>> [dancer:32162] [ 4] /home/bosilca/opt/trunk/fast/l
>> ib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
>> [dancer:32162] [ 5] /home/bosilca/opt/trunk/fast/l
>> ib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324
>> )[0x7f9075bedca4]
>> [dancer:32162] [ 6] /home/bosilca/opt/trunk/fast/l
>> ib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)
>> [0x7f90758eafec]
>> [dancer:32162] [ 7] mpirun[0x401251]
>> [dancer:32162] [ 8] mpirun[0x400e24]
>> [dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_
>> main+0xfd)[0x312621ed1d]
>> [dancer:32162] [10] mpirun[0x400d49]
>> [dancer:32162] *** End of error message ***
>> Segmentation fault
>>
>> As you can see on the hwloc output below, there are 2 NUMA nodes on the
>> node and HWLOC correctly identifies them, making OMPI error message
>> confusing. Anyway, we should not segfault but report a more meaning error
>> message.
>>
>> Binding to slot (I got this from the man page for 2.0) is apparently not
>> supported anymore. Reminder: We should update the manpage accordingly.
>>
>> Trying to bind to core looks better, the application at least starts.
>> Unfortunately the reported bindings (or at least my understanding of these
>> bindings) are troubling. Assuming that the way we report the bindings is
>> correct, why are my processes assigned to 2 cores far apart each ?
>>
>> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
>> [arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>> [arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>>
>> Maybe because I only used the binding option. Adding the mapping to the
>> mix (--map-by option) seem hopeless, the binding remains unchanged for 2
>> processes.
>>
>> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
>> [arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>> [arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>>
>> At this point I really wondered what is going on. To clarify I tried to
>> launch 3 processes on the node. Bummer ! the reported binding shows that
>> one of my processes got assigned to cores on different sockets.
>>
>> $ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true
>> [arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>> [arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core
>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>> [arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core
>> 12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../.
>> .]
>>
>> Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the
>> mapping will help. Will I get a more sensible binding (as suggested by our
>> online documentation and the man pages) ?
>>
>> $ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core
>> --report-bindings true
>> [arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>> [arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..]
>> [arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core
>> 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../.
>> .]
>>
>> There is a difference. The logical rank of processes is now respected but
>> one of my processes is still bound to 2 cores on different sockets, but
>> these cores are different from the case when the mapping was not specified.
>>
>> Trying to bind on sockets I got an even more confusing outcome. So I went
>> the hard way, what can go wrong if I manually define the binding via a
>> rankfile ? Fail ! My processes continue to report an unsettling bindings
>> (there is some relationship with my rank file but most of the issues I
>> reported above still remain).
>>
>> $ more rankfile
>> rank 0=arc00 slot=0
>> rank 1=arc00 slot=2
>> $ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true
>> [arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..]
>> [arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core
>> 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../.
>> .]
>>
>> At this point I got pretty much completely confused with how OMPI binding
>> works. I'm counting on a good samaritan to explain how this works.
>>
>> Thanks,
>>   George.
>>
>> PS: rankfile feature of using relative hostnames (+n?) seems to be busted
>> as the example from the man page leads to the following complaint
>>
>> ------------------------------------------------------------
>> --------------
>> A relative host was specified, but no prior allocation has been made.
>> Thus, there is no way to determine the proper host to be used.
>>
>> hostfile entry: +n0
>>
>> Please see the orte_hosts man page for further information.
>> ------------------------------------------------------------
>> --------------
>>
>>
>> $ hwloc-ls
>> Machine (63GB)
>>   NUMANode L#0 (P#0 31GB)
>>     Socket L#0 + L3 L#0 (25MB)
>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>         PU L#0 (P#0)
>>         PU L#1 (P#20)
>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>         PU L#2 (P#1)
>>         PU L#3 (P#21)
>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>         PU L#4 (P#2)
>>         PU L#5 (P#22)
>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>         PU L#6 (P#3)
>>         PU L#7 (P#23)
>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>         PU L#8 (P#4)
>>         PU L#9 (P#24)
>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>         PU L#10 (P#5)
>>         PU L#11 (P#25)
>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>         PU L#12 (P#6)
>>         PU L#13 (P#26)
>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>         PU L#14 (P#7)
>>         PU L#15 (P#27)
>>       L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>         PU L#16 (P#8)
>>         PU L#17 (P#28)
>>       L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>         PU L#18 (P#9)
>>         PU L#19 (P#29)
>>   NUMANode L#1 (P#1 31GB)
>>     Socket L#1 + L3 L#1 (25MB)
>>       L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>         PU L#20 (P#10)
>>         PU L#21 (P#30)
>>       L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>         PU L#22 (P#11)
>>         PU L#23 (P#31)
>>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>         PU L#24 (P#12)
>>         PU L#25 (P#32)
>>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>         PU L#26 (P#13)
>>         PU L#27 (P#33)
>>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>         PU L#28 (P#14)
>>         PU L#29 (P#34)
>>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>         PU L#30 (P#15)
>>         PU L#31 (P#35)
>>       L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
>>         PU L#32 (P#16)
>>         PU L#33 (P#36)
>>       L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
>>         PU L#34 (P#17)
>>         PU L#35 (P#37)
>>       L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
>>         PU L#36 (P#18)
>>         PU L#37 (P#38)
>>       L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
>>         PU L#38 (P#19)
>>         PU L#39 (P#39)
>>
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>

_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Question about Open MPI bindings

Reply via email to