On Sat, Sep 3, 2016 at 10:34 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
> Interesting - well, it looks like ORTE is working correctly. The map is > what you would expect, and so is planned binding. > > What this tells us is that we are indeed binding (so far as ORTE is > concerned) to the correct places. Rank 0 is being bound to 0,8, and that is > what the OS reports. Rank 1 is bound to 4,12, and rank 2 is bound to 1,9. > All of this matches what the OS reported. > > So it looks like it is report-bindings that is messed up for some reason. > Ralph, I have a hard time agreeing with you here. The binding you find correct is, from a performance point of view, terrible. Why would anybody want a process to be bound to 2 cores on different sockets ? Please help me with the following exercise. How do I bind each process to a single core, allocated in a round robin fashion (such as rank 0 on core 0, rank 1 on core 1 and rank 2 on core 2) ? George. > > > On Sep 3, 2016, at 7:14 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > $mpirun -np 3 --tag-output --bind-to core --report-bindings > --display-devel-map --mca rmaps_base_verbose 10 true > > [dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities >> [dancer.icl.utk.edu:17451] Mapper: ppr Priority: 90 >> [dancer.icl.utk.edu:17451] Mapper: seq Priority: 60 >> [dancer.icl.utk.edu:17451] Mapper: resilient Priority: 40 >> [dancer.icl.utk.edu:17451] Mapper: mindist Priority: 20 >> [dancer.icl.utk.edu:17451] Mapper: round_robin Priority: 10 >> [dancer.icl.utk.edu:17451] Mapper: staged Priority: 5 >> [dancer.icl.utk.edu:17451] Mapper: rank_file Priority: 0 >> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1] >> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job >> [41198,1] >> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user - >> using bysocket >> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr >> mapper PPR NULL policy PPR NOTSET >> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq >> mapper >> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial >> map of job [41198,1] - no fault groups >> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using >> mindist mapper >> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1] >> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING: >> [dancer.icl.utk.edu:17451] node: arc00 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc01 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc02 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc03 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc04 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc05 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc06 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc07 daemon: NULL >> [dancer.icl.utk.edu:17451] node: arc08 daemon: NULL >> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for >> job [41198,1] slots 180 num_procs 3 >> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on node >> arc00 >> [dancer.icl.utk.edu:17451] mca:rmaps:rr: calculated nprocs 20 >> [dancer.icl.utk.edu:17451] mca:rmaps:rr: assigning nprocs 20 >> [dancer.icl.utk.edu:17451] mca:rmaps:base: computing vpids by slot for >> job [41198,1] >> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 0 to node arc00 >> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 1 to node arc00 >> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 2 to node arc00 >> [dancer.icl.utk.edu:17451] mca:rmaps: compute bindings for job [41198,1] >> with policy CORE[4008] >> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: node arc00 has 3 >> procs on it >> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc >> [[41198,1],0] >> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc >> [[41198,1],1] >> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc >> [[41198,1],2] >> [dancer.icl.utk.edu:17451] [[41198,0],0] bind_depth: 5 map_depth 1 >> [dancer.icl.utk.edu:17451] mca:rmaps: bind downward for job [41198,1] >> with bindings CORE >> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS >> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],0] BITMAP 0,8 >> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],0][arc00] >> TO socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../.. >> ] >> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS >> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],1] BITMAP 4,12 >> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],1][arc00] >> TO socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../.. >> ] >> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS >> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],2] BITMAP 1,9 >> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],2][arc00] >> TO socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../.. >> ] >> [1,0]<stderr>:[arc00:07612] MCW rank 0 bound to socket 0[core 0[hwt 0]], >> socket 0[core 8[hwt 0]]: [B./../../../../../../../B./.. >> ][../../../../../../../../../..] >> [1,1]<stderr>:[arc00:07612] MCW rank 1 bound to socket 0[core 4[hwt 0]], >> socket 1[core 12[hwt 0]]: [../../../../B./../../../../. >> .][../../B./../../../../../../..] >> [1,2]<stderr>:[arc00:07612] MCW rank 2 bound to socket 0[core 1[hwt 0]], >> socket 0[core 9[hwt 0]]: [../B./../../../../../../../B. >> ][../../../../../../../../../..] > > > On Sat, Sep 3, 2016 at 9:44 AM, r...@open-mpi.org <r...@open-mpi.org> wrote: > >> Okay, can you add --display-devel-map --mca rmaps_base_verbose 10 to your >> cmd line? >> >> It sounds like there is something about that topo that is bothering the >> mapper >> >> On Sep 2, 2016, at 9:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> Thanks Gilles, that's a very useful trick. The bindings reported by ORTE >> are in sync with the one reported by the OS. >> >> $ mpirun -np 2 --tag-output --bind-to core --report-bindings grep >> Cpus_allowed_list /proc/self/status >> [1,0]<stderr>:[arc00:90813] MCW rank 0 bound to socket 0[core 0[hwt 0]], >> socket 0[core 4[hwt 0]]: [B./../../../B./../../../../.. >> ][../../../../../../../../../..] >> [1,1]<stderr>:[arc00:90813] MCW rank 1 bound to socket 1[core 10[hwt 0]], >> socket 1[core 14[hwt 0]]: [../../../../../../../../../.. >> ][B./../../../B./../../../../..] >> [1,0]<stdout>:Cpus_allowed_list: 0,8 >> [1,1]<stdout>:Cpus_allowed_list: 1,9 >> >> George. >> >> >> >> On Sat, Sep 3, 2016 at 12:27 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >>> George, >>> >>> I cannot help much with this i am afraid >>> >>> My best bet would be to rebuild OpenMPI with --enable-debug and an >>> external recent hwloc (iirc hwloc v2 cannot be used in Open MPI yet) >>> >>> You might also want to try >>> mpirun --tag-output --bind-to xxx --report-bindings grep >>> Cpus_allowed_list /proc/self/status >>> >>> So you can confirm both openmpi and /proc/self/status report the same >>> thing >>> >>> Hope this helps a bit ... >>> >>> Gilles >>> >>> >>> George Bosilca <bosi...@icl.utk.edu> wrote: >>> While investigating the ongoing issue with OMPI messaging layer, I run >>> into some troubles with process binding. I read the documentation, but I >>> still find this puzzling. >>> >>> Disclaimer: all experiments were done with current master (9c496f7) >>> compiled in optimized mode. The hardware: a single node 20 core >>> Xeon E5-2650 v3 (hwloc-ls is at the end of this email). >>> >>> First and foremost, trying to bind to NUMA nodes was a sure way to get a >>> segfault: >>> >>> $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true >>> ------------------------------------------------------------ >>> -------------- >>> No objects of the specified type were found on at least one node: >>> >>> Type: NUMANode >>> Node: arc00 >>> >>> The map cannot be done as specified. >>> ------------------------------------------------------------ >>> -------------- >>> [dancer:32162] *** Process received signal *** >>> [dancer:32162] Signal: Segmentation fault (11) >>> [dancer:32162] Signal code: Address not mapped (1) >>> [dancer:32162] Failing at address: 0x3c >>> [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0] >>> [dancer:32162] [ 1] /home/bosilca/opt/trunk/fast/l >>> ib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0] >>> [dancer:32162] [ 2] /home/bosilca/opt/trunk/fast/l >>> ib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54] >>> [dancer:32162] [ 3] /home/bosilca/opt/trunk/fast/l >>> ib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308] >>> [dancer:32162] [ 4] /home/bosilca/opt/trunk/fast/l >>> ib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e] >>> [dancer:32162] [ 5] /home/bosilca/opt/trunk/fast/l >>> ib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324 >>> )[0x7f9075bedca4] >>> [dancer:32162] [ 6] /home/bosilca/opt/trunk/fast/l >>> ib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c) >>> [0x7f90758eafec] >>> [dancer:32162] [ 7] mpirun[0x401251] >>> [dancer:32162] [ 8] mpirun[0x400e24] >>> [dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_ >>> main+0xfd)[0x312621ed1d] >>> [dancer:32162] [10] mpirun[0x400d49] >>> [dancer:32162] *** End of error message *** >>> Segmentation fault >>> >>> As you can see on the hwloc output below, there are 2 NUMA nodes on the >>> node and HWLOC correctly identifies them, making OMPI error message >>> confusing. Anyway, we should not segfault but report a more meaning error >>> message. >>> >>> Binding to slot (I got this from the man page for 2.0) is apparently not >>> supported anymore. Reminder: We should update the manpage accordingly. >>> >>> Trying to bind to core looks better, the application at least starts. >>> Unfortunately the reported bindings (or at least my understanding of these >>> bindings) are troubling. Assuming that the way we report the bindings is >>> correct, why are my processes assigned to 2 cores far apart each ? >>> >>> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true >>> [arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core >>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../. >>> .] >>> [arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core >>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../. >>> .] >>> >>> Maybe because I only used the binding option. Adding the mapping to the >>> mix (--map-by option) seem hopeless, the binding remains unchanged for 2 >>> processes. >>> >>> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true >>> [arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core >>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../. >>> .] >>> [arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core >>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../. >>> .] >>> >>> At this point I really wondered what is going on. To clarify I tried to >>> launch 3 processes on the node. Bummer ! the reported binding shows that >>> one of my processes got assigned to cores on different sockets. >>> >>> $ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true >>> [arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core >>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../. >>> .] >>> [arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core >>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../. >>> .] >>> [arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core >>> 12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../. >>> .] >>> >>> Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the >>> mapping will help. Will I get a more sensible binding (as suggested by our >>> online documentation and the man pages) ? >>> >>> $ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core >>> --report-bindings true >>> [arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core >>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../. >>> .] >>> [arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core >>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../. >>> .] >>> [arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core >>> 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../. >>> .] >>> >>> There is a difference. The logical rank of processes is now respected >>> but one of my processes is still bound to 2 cores on different sockets, but >>> these cores are different from the case when the mapping was not specified. >>> >>> Trying to bind on sockets I got an even more confusing outcome. So I >>> went the hard way, what can go wrong if I manually define the binding via a >>> rankfile ? Fail ! My processes continue to report an unsettling bindings >>> (there is some relationship with my rank file but most of the issues I >>> reported above still remain). >>> >>> $ more rankfile >>> rank 0=arc00 slot=0 >>> rank 1=arc00 slot=2 >>> $ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true >>> [arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core >>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../. >>> .] >>> [arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core >>> 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../. >>> .] >>> >>> At this point I got pretty much completely confused with how OMPI >>> binding works. I'm counting on a good samaritan to explain how this works. >>> >>> Thanks, >>> George. >>> >>> PS: rankfile feature of using relative hostnames (+n?) seems to be >>> busted as the example from the man page leads to the following complaint >>> >>> ------------------------------------------------------------ >>> -------------- >>> A relative host was specified, but no prior allocation has been made. >>> Thus, there is no way to determine the proper host to be used. >>> >>> hostfile entry: +n0 >>> >>> Please see the orte_hosts man page for further information. >>> ------------------------------------------------------------ >>> -------------- >>> >>> >>> $ hwloc-ls >>> Machine (63GB) >>> NUMANode L#0 (P#0 31GB) >>> Socket L#0 + L3 L#0 (25MB) >>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 >>> PU L#0 (P#0) >>> PU L#1 (P#20) >>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 >>> PU L#2 (P#1) >>> PU L#3 (P#21) >>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 >>> PU L#4 (P#2) >>> PU L#5 (P#22) >>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 >>> PU L#6 (P#3) >>> PU L#7 (P#23) >>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 >>> PU L#8 (P#4) >>> PU L#9 (P#24) >>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 >>> PU L#10 (P#5) >>> PU L#11 (P#25) >>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 >>> PU L#12 (P#6) >>> PU L#13 (P#26) >>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 >>> PU L#14 (P#7) >>> PU L#15 (P#27) >>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 >>> PU L#16 (P#8) >>> PU L#17 (P#28) >>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 >>> PU L#18 (P#9) >>> PU L#19 (P#29) >>> NUMANode L#1 (P#1 31GB) >>> Socket L#1 + L3 L#1 (25MB) >>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 >>> PU L#20 (P#10) >>> PU L#21 (P#30) >>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 >>> PU L#22 (P#11) >>> PU L#23 (P#31) >>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 >>> PU L#24 (P#12) >>> PU L#25 (P#32) >>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 >>> PU L#26 (P#13) >>> PU L#27 (P#33) >>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 >>> PU L#28 (P#14) >>> PU L#29 (P#34) >>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 >>> PU L#30 (P#15) >>> PU L#31 (P#35) >>> L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 >>> PU L#32 (P#16) >>> PU L#33 (P#36) >>> L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 >>> PU L#34 (P#17) >>> PU L#35 (P#37) >>> L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 >>> PU L#36 (P#18) >>> PU L#37 (P#38) >>> L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 >>> PU L#38 (P#19) >>> PU L#39 (P#39) >>> >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel