Thanks for all these suggestions. I could get the expected bindings by 1)
removing the vm and 2) adding hetero. This is far from an ideal setting, as
now I have to make my own machinefile for every single run, or spawn
daemons on all the machines on the cluster.

Wouldn't it be useful to make the daemon check the number of slots provided
in the machine file and check if this match the local cores (and if not
force the hetero node automatically) ?

  George.

PS: Is there an MCA parameter for "hetero-nodes" ?


On Sat, Sep 3, 2016 at 8:07 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Ah, indeed - if the node where mpirun is executing doesn’t match the
> compute nodes, then you must remove that --novm option. Otherwise, we have
> no way of knowing what the compute node topology looks like.
>
>
> On Sep 3, 2016, at 4:13 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> George,
>
> If i understand correctly, you are running mpirun on dancer, which has
> 2 sockets, 4 cores per socket and 2 hwthreads per core,
> and orted are running on arc[00-08], though the tasks only run on arc00,
> which has
> 2 sockets, 10 cores per socket and 2 hwthreads per core
>
> to me, it looks like OpenMPI assumes all nodes are similar to dancer,
> which is incorrect.
>
> Can you try again with the --hetero-nodes option ?
> (iirc, that should not be needed because nodes should have different
> "hwloc signatures", and OpenMPI is supposed to handle that automatically
> and correctly)
>
> That could be a side effect of your MCA params, can you try to remove them
> and
> mpirun --host arc00 --bind-to core -np 2 --report bindings grep
> Cpus_alliwed_list /proc/self/status
> And one more test plus the --hetero-nodes option ?
>
> Bottom line, you might have to set yet an other MCA param equivalent to
> the --hetero-nodes option.
>
> Cheers,
>
> Gilles
>
> r...@open-mpi.org wrote:
> Interesting - well, it looks like ORTE is working correctly. The map is
> what you would expect, and so is planned binding.
>
> What this tells us is that we are indeed binding (so far as ORTE is
> concerned) to the correct places. Rank 0 is being bound to 0,8, and that is
> what the OS reports. Rank 1 is bound to 4,12, and rank 2 is bound to 1,9.
> All of this matches what the OS reported.
>
> So it looks like it is report-bindings that is messed up for some reason.
>
>
> On Sep 3, 2016, at 7:14 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> $mpirun -np 3 --tag-output --bind-to core --report-bindings
> --display-devel-map --mca rmaps_base_verbose 10 true
>
> [dancer.icl.utk.edu:17451] [[41198,0],0]: Final mapper priorities
>> [dancer.icl.utk.edu:17451]      Mapper: ppr Priority: 90
>> [dancer.icl.utk.edu:17451]      Mapper: seq Priority: 60
>> [dancer.icl.utk.edu:17451]      Mapper: resilient Priority: 40
>> [dancer.icl.utk.edu:17451]      Mapper: mindist Priority: 20
>> [dancer.icl.utk.edu:17451]      Mapper: round_robin Priority: 10
>> [dancer.icl.utk.edu:17451]      Mapper: staged Priority: 5
>> [dancer.icl.utk.edu:17451]      Mapper: rank_file Priority: 0
>> [dancer.icl.utk.edu:17451] mca:rmaps: mapping job [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps: setting mapping policies for job
>> [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps[153] mapping not set by user -
>> using bysocket
>> [dancer.icl.utk.edu:17451] mca:rmaps:ppr: job [41198,1] not using ppr
>> mapper PPR NULL policy PPR NOTSET
>> [dancer.icl.utk.edu:17451] mca:rmaps:seq: job [41198,1] not using seq
>> mapper
>> [dancer.icl.utk.edu:17451] mca:rmaps:resilient: cannot perform initial
>> map of job [41198,1] - no fault groups
>> [dancer.icl.utk.edu:17451] mca:rmaps:mindist: job [41198,1] not using
>> mindist mapper
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping job [41198,1]
>> [dancer.icl.utk.edu:17451] AVAILABLE NODES FOR MAPPING:
>> [dancer.icl.utk.edu:17451]     node: arc00 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc01 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc02 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc03 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc04 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc05 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc06 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc07 daemon: NULL
>> [dancer.icl.utk.edu:17451]     node: arc08 daemon: NULL
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: mapping no-span by Package for
>> job [41198,1] slots 180 num_procs 3
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: found 2 Package objects on node
>> arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: calculated nprocs 20
>> [dancer.icl.utk.edu:17451] mca:rmaps:rr: assigning nprocs 20
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: computing vpids by slot for
>> job [41198,1]
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 0 to node arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 1 to node arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps:base: assigning rank 2 to node arc00
>> [dancer.icl.utk.edu:17451] mca:rmaps: compute bindings for job [41198,1]
>> with policy CORE[4008]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: node arc00 has 3
>> procs on it
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
>> [[41198,1],0]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
>> [[41198,1],1]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] reset_usage: ignoring proc
>> [[41198,1],2]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] bind_depth: 5 map_depth 1
>> [dancer.icl.utk.edu:17451] mca:rmaps: bind downward for job [41198,1]
>> with bindings CORE
>> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
>> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],0] BITMAP 0,8
>> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],0][arc00]
>> TO socket 0[core 0[hwt 0-1]]: [BB/../../..][../../../..
>> ]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
>> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],1] BITMAP 4,12
>> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],1][arc00]
>> TO socket 1[core 4[hwt 0-1]]: [../../../..][BB/../../..
>> ]
>> [dancer.icl.utk.edu:17451] [[41198,0],0] GOT 1 CPUS
>> [dancer.icl.utk.edu:17451] [[41198,0],0] PROC [[41198,1],2] BITMAP 1,9
>> [dancer.icl.utk.edu:17451] [[41198,0],0] BOUND PROC [[41198,1],2][arc00]
>> TO socket 0[core 1[hwt 0-1]]: [../BB/../..][../../../..
>> ]
>> [1,0]<stderr>:[arc00:07612] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 8[hwt 0]]: [B./../../../../../../../B./..
>> ][../../../../../../../../../..]
>> [1,1]<stderr>:[arc00:07612] MCW rank 1 bound to socket 0[core 4[hwt 0]],
>> socket 1[core 12[hwt 0]]: [../../../../B./../../../../.
>> .][../../B./../../../../../../..]
>> [1,2]<stderr>:[arc00:07612] MCW rank 2 bound to socket 0[core 1[hwt 0]],
>> socket 0[core 9[hwt 0]]: [../B./../../../../../../../B.
>> ][../../../../../../../../../..]
>
>
> On Sat, Sep 3, 2016 at 9:44 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>
>> Okay, can you add --display-devel-map --mca rmaps_base_verbose 10 to your
>> cmd line?
>>
>> It sounds like there is something about that topo that is bothering the
>> mapper
>>
>> On Sep 2, 2016, at 9:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>> Thanks Gilles, that's a very useful trick. The bindings reported by ORTE
>> are in sync with the one reported by the OS.
>>
>> $ mpirun -np 2 --tag-output --bind-to core --report-bindings grep
>> Cpus_allowed_list /proc/self/status
>> [1,0]<stderr>:[arc00:90813] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket 0[core 4[hwt 0]]: [B./../../../B./../../../../..
>> ][../../../../../../../../../..]
>> [1,1]<stderr>:[arc00:90813] MCW rank 1 bound to socket 1[core 10[hwt 0]],
>> socket 1[core 14[hwt 0]]: [../../../../../../../../../..
>> ][B./../../../B./../../../../..]
>> [1,0]<stdout>:Cpus_allowed_list:        0,8
>> [1,1]<stdout>:Cpus_allowed_list:        1,9
>>
>> George.
>>
>>
>>
>> On Sat, Sep 3, 2016 at 12:27 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> George,
>>>
>>> I cannot help much with this i am afraid
>>>
>>> My best bet would be to rebuild OpenMPI with --enable-debug and an
>>> external recent hwloc (iirc hwloc v2 cannot be used in Open MPI yet)
>>>
>>> You might also want to try
>>> mpirun --tag-output --bind-to xxx --report-bindings grep
>>> Cpus_allowed_list /proc/self/status
>>>
>>> So you can confirm both openmpi and /proc/self/status report the same
>>> thing
>>>
>>> Hope this helps a bit ...
>>>
>>> Gilles
>>>
>>>
>>> George Bosilca <bosi...@icl.utk.edu> wrote:
>>> While investigating the ongoing issue with OMPI messaging layer, I run
>>> into some troubles with process binding. I read the documentation, but I
>>> still find this puzzling.
>>>
>>> Disclaimer: all experiments were done with current master (9c496f7)
>>> compiled in optimized mode. The hardware: a single node 20 core
>>> Xeon E5-2650 v3 (hwloc-ls is at the end of this email).
>>>
>>> First and foremost, trying to bind to NUMA nodes was a sure way to get a
>>> segfault:
>>>
>>> $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true
>>> ------------------------------------------------------------
>>> --------------
>>> No objects of the specified type were found on at least one node:
>>>
>>>   Type: NUMANode
>>>   Node: arc00
>>>
>>> The map cannot be done as specified.
>>> ------------------------------------------------------------
>>> --------------
>>> [dancer:32162] *** Process received signal ***
>>> [dancer:32162] Signal: Segmentation fault (11)
>>> [dancer:32162] Signal code: Address not mapped (1)
>>> [dancer:32162] Failing at address: 0x3c
>>> [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0]
>>> [dancer:32162] [ 1] /home/bosilca/opt/trunk/fast/l
>>> ib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0]
>>> [dancer:32162] [ 2] /home/bosilca/opt/trunk/fast/l
>>> ib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54]
>>> [dancer:32162] [ 3] /home/bosilca/opt/trunk/fast/l
>>> ib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308]
>>> [dancer:32162] [ 4] /home/bosilca/opt/trunk/fast/l
>>> ib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e]
>>> [dancer:32162] [ 5] /home/bosilca/opt/trunk/fast/l
>>> ib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324
>>> )[0x7f9075bedca4]
>>> [dancer:32162] [ 6] /home/bosilca/opt/trunk/fast/l
>>> ib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)
>>> [0x7f90758eafec]
>>> [dancer:32162] [ 7] mpirun[0x401251]
>>> [dancer:32162] [ 8] mpirun[0x400e24]
>>> [dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_
>>> main+0xfd)[0x312621ed1d]
>>> [dancer:32162] [10] mpirun[0x400d49]
>>> [dancer:32162] *** End of error message ***
>>> Segmentation fault
>>>
>>> As you can see on the hwloc output below, there are 2 NUMA nodes on the
>>> node and HWLOC correctly identifies them, making OMPI error message
>>> confusing. Anyway, we should not segfault but report a more meaning error
>>> message.
>>>
>>> Binding to slot (I got this from the man page for 2.0) is apparently not
>>> supported anymore. Reminder: We should update the manpage accordingly.
>>>
>>> Trying to bind to core looks better, the application at least starts.
>>> Unfortunately the reported bindings (or at least my understanding of these
>>> bindings) are troubling. Assuming that the way we report the bindings is
>>> correct, why are my processes assigned to 2 cores far apart each ?
>>>
>>> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
>>> [arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../.
>>> .]
>>> [arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
>>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../.
>>> .]
>>>
>>> Maybe because I only used the binding option. Adding the mapping to the
>>> mix (--map-by option) seem hopeless, the binding remains unchanged for 2
>>> processes.
>>>
>>> $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true
>>> [arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../.
>>> .]
>>> [arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
>>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../.
>>> .]
>>>
>>> At this point I really wondered what is going on. To clarify I tried to
>>> launch 3 processes on the node. Bummer ! the reported binding shows that
>>> one of my processes got assigned to cores on different sockets.
>>>
>>> $ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true
>>> [arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../.
>>> .]
>>> [arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core
>>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../.
>>> .]
>>> [arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core
>>> 12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../.
>>> .]
>>>
>>> Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the
>>> mapping will help. Will I get a more sensible binding (as suggested by our
>>> online documentation and the man pages) ?
>>>
>>> $ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core
>>> --report-bindings true
>>> [arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../.
>>> .]
>>> [arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core
>>> 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../.
>>> .]
>>> [arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core
>>> 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../.
>>> .]
>>>
>>> There is a difference. The logical rank of processes is now respected
>>> but one of my processes is still bound to 2 cores on different sockets, but
>>> these cores are different from the case when the mapping was not specified.
>>>
>>> Trying to bind on sockets I got an even more confusing outcome. So I
>>> went the hard way, what can go wrong if I manually define the binding via a
>>> rankfile ? Fail ! My processes continue to report an unsettling bindings
>>> (there is some relationship with my rank file but most of the issues I
>>> reported above still remain).
>>>
>>> $ more rankfile
>>> rank 0=arc00 slot=0
>>> rank 1=arc00 slot=2
>>> $ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true
>>> [arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core
>>> 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../.
>>> .]
>>> [arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core
>>> 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../.
>>> .]
>>>
>>> At this point I got pretty much completely confused with how OMPI
>>> binding works. I'm counting on a good samaritan to explain how this works.
>>>
>>> Thanks,
>>>   George.
>>>
>>> PS: rankfile feature of using relative hostnames (+n?) seems to be
>>> busted as the example from the man page leads to the following complaint
>>>
>>> ------------------------------------------------------------
>>> --------------
>>> A relative host was specified, but no prior allocation has been made.
>>> Thus, there is no way to determine the proper host to be used.
>>>
>>> hostfile entry: +n0
>>>
>>> Please see the orte_hosts man page for further information.
>>> ------------------------------------------------------------
>>> --------------
>>>
>>>
>>> $ hwloc-ls
>>> Machine (63GB)
>>>   NUMANode L#0 (P#0 31GB)
>>>     Socket L#0 + L3 L#0 (25MB)
>>>       L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>>>         PU L#0 (P#0)
>>>         PU L#1 (P#20)
>>>       L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>>>         PU L#2 (P#1)
>>>         PU L#3 (P#21)
>>>       L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>>>         PU L#4 (P#2)
>>>         PU L#5 (P#22)
>>>       L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>>>         PU L#6 (P#3)
>>>         PU L#7 (P#23)
>>>       L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
>>>         PU L#8 (P#4)
>>>         PU L#9 (P#24)
>>>       L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
>>>         PU L#10 (P#5)
>>>         PU L#11 (P#25)
>>>       L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
>>>         PU L#12 (P#6)
>>>         PU L#13 (P#26)
>>>       L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
>>>         PU L#14 (P#7)
>>>         PU L#15 (P#27)
>>>       L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
>>>         PU L#16 (P#8)
>>>         PU L#17 (P#28)
>>>       L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
>>>         PU L#18 (P#9)
>>>         PU L#19 (P#29)
>>>   NUMANode L#1 (P#1 31GB)
>>>     Socket L#1 + L3 L#1 (25MB)
>>>       L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>>         PU L#20 (P#10)
>>>         PU L#21 (P#30)
>>>       L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>>         PU L#22 (P#11)
>>>         PU L#23 (P#31)
>>>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
>>>         PU L#24 (P#12)
>>>         PU L#25 (P#32)
>>>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
>>>         PU L#26 (P#13)
>>>         PU L#27 (P#33)
>>>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>>>         PU L#28 (P#14)
>>>         PU L#29 (P#34)
>>>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>>>         PU L#30 (P#15)
>>>         PU L#31 (P#35)
>>>       L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
>>>         PU L#32 (P#16)
>>>         PU L#33 (P#36)
>>>       L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
>>>         PU L#34 (P#17)
>>>         PU L#35 (P#37)
>>>       L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
>>>         PU L#36 (P#18)
>>>         PU L#37 (P#38)
>>>       L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
>>>         PU L#38 (P#19)
>>>         PU L#39 (P#39)
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to