Here is what I see on my machine: 07:59:55 (v1.8) /home/common/openmpi/ompi-release$ mpirun -np 8 --display-devel-map --report-bindings --map-by core -host bend001 --bind-to core hostname Data for JOB [45531,1] offset 0
Mapper requested: NULL Last mapper: round_robin Mapping policy: BYCORE Ranking policy: CORE Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 Num new daemons: 0 New daemon starting vpid INVALID Num nodes: 1 Data for node: bend001 Launch id: -1 State: 2 Daemon: [[45531,0],0] Daemon launched: True Num slots: 12 Slots in use: 8 Oversubscribed: FALSE Num slots allocated: 12 Max slots: 0 Username on node: NULL Num procs: 8 Next node_rank: 8 Data for proc: [[45531,1],0] Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0,12 Bind location: 0,12 Binding: 0,12 Data for proc: [[45531,1],1] Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 2,14 Bind location: 2,14 Binding: 2,14 Data for proc: [[45531,1],2] Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 4,16 Bind location: 4,16 Binding: 4,16 Data for proc: [[45531,1],3] Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 6,18 Bind location: 6,18 Binding: 6,18 Data for proc: [[45531,1],4] Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 8,20 Bind location: 8,20 Binding: 8,20 Data for proc: [[45531,1],5] Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 10,22 Bind location: 10,22 Binding: 10,22 Data for proc: [[45531,1],6] Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 1,13 Bind location: 1,13 Binding: 1,13 Data for proc: [[45531,1],7] Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 State: INITIALIZED Restarts: 0 App_context: 0 Locale: 3,15 Bind location: 3,15 Binding: 3,15 [bend001:15493] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..] [bend001:15493] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..] [bend001:15493] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..] [bend001:15493] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..] [bend001:15493] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/..][../../../../../..] [bend001:15493] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB][../../../../../..] [bend001:15493] MCW rank 6 bound to socket 1[core 6[hwt 0-1]]: [../../../../../..][BB/../../../../..] [bend001:15493] MCW rank 7 bound to socket 1[core 7[hwt 0-1]]: [../../../../../..][../BB/../../../..] I have HT enabled on my box, so the devel-map is showing Locale, Bind location, and Binding as the logical HT numbers (i.e., the PUs) for that proc. As you can see in the report-bindings output, things are indeed going where they should go. The numbering in the devel-map always looks a little funny because it depends on how the bios numbered cpus. Unlike you might expect, they do tend to bounce around. In my case, for example, the bios has assigned the HTs and cores in the first socket with all the even numbered PUs, and the second socket got all the odd numbers. In other words, it assigned PUs round-robin by socket instead of sequentially across each socket. <shrug> every bios does it differently, so there is no way to provide a standardized output. This is why we have report-bindings to tell the user where they actually wound up. HTH Ralph > On Apr 21, 2015, at 7:54 AM, Devendar Bureddy <deven...@mellanox.com> wrote: > > I agree. > > -----Original Message----- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres > (jsquyres) > Sent: Tuesday, April 21, 2015 7:17 AM > To: Open MPI Developers List > Subject: Re: [OMPI devel] binding output error > > +1 > > Devendar, you seem to be reporting a different issue than Elena...? FWIW: > Open MPI has always used logical CPU numbering. As far as I can tell from > your output, it looks like Open MPI did the Right Thing with your examples. > > Elena's example seemed to show conflicting cpu numbering -- where OMPI said > it would bind a process and then where it actually bound it. Ralph mentioned > to me that he would look at this as soon as he could; he thinks it might just > be an error in the printf output (and that the binding is actually occurring > in the right location). > > > >> On Apr 20, 2015, at 9:48 PM, tmish...@jcity.maeda.co.jp wrote: >> >> Hi Devendar, >> >> As far as I know, the report-bindings option shows the logical cpu >> order. On the other hand, you are talking about physical one, I guess. >> >> Regards, >> Tetsuya Mishima >> >> 2015/04/21 9:04:37、"devel"さんは「Re: [OMPI devel] binding output >> error」で書きました >>> HT is not enabled. All node are same topo . This is reproducible >>> even on >> single node. >>> >>> >>> >>> I ran osu latency to see if it is really is mapped to other socket or >>> not >> with –map-by socket. It looks likes mapping is correct as per latency >> test. >>> >>> >>> >>> $mpirun -np 2 -report-bindings -map-by >> socket >> /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/o >> su-micro-benchmarks-4.4.1/osu_latency >> >>> >>> [clx-orion-001:10084] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >> [B/././././././././././././.][./././././././././././././.] >>> >>> [clx-orion-001:10084] MCW rank 1 bound to socket 1[core 14[hwt 0]]: >> [./././././././././././././.][B/././././././././././././.] >>> >>> # OSU MPI Latency Test v4.4.1 >>> >>> # Size Latency (us) >>> >>> 0 0.50 >>> >>> 1 0.50 >>> >>> 2 0.50 >>> >>> 4 0.49 >>> >>> >>> >>> >>> >>> $mpirun -np 2 -report-bindings -cpu-set >> 1,7 >> /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/o >> su-micro-benchmarks-4.4.1/osu_latency >> >>> >>> [clx-orion-001:10155] MCW rank 0 bound to socket 0[core 1[hwt 0]]: >> [./B/./././././././././././.][./././././././././././././.] >>> >>> [clx-orion-001:10155] MCW rank 1 bound to socket 0[core 7[hwt 0]]: >> [./././././././B/./././././.][./././././././././././././.] >>> >>> # OSU MPI Latency Test v4.4.1 >>> >>> # Size Latency (us) >>> >>> 0 0.23 >>> >>> 1 0.24 >>> >>> 2 0.23 >>> >>> 4 0.22 >>> >>> 8 0.23 >>> >>> >>> >>> Both hwloc and /proc/cpuinfo indicates following cpu numbering >>> >>> socket 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20 >>> >>> socket 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27 >>> >>> >>> >>> $hwloc-info -f >>> >>> Machine (256GB) >>> >>> NUMANode L#0 (P#0 128GB) + Socket L#0 + L3 L#0 (35MB) >>> >>> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) >>> >>> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) >>> >>> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) >>> >>> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) >>> >>> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) >>> >>> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) >>> >>> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) >>> >>> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#14) >>> >>> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#15) >>> >>> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#16) >>> >>> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#17) >>> >>> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#18) >>> >>> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#19) >>> >>> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#20) >>> >>> NUMANode L#1 (P#1 128GB) + Socket L#1 + L3 L#1 (35MB) >>> >>> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#7) >>> >>> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#8) >>> >>> L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#9) >>> >>> L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#10) >>> >>> L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#11) >>> >>> L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#12) >>> >>> L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#13) >>> >>> L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#21) >>> >>> L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#22) >>> >>> L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#23) >>> >>> L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#24) >>> >>> L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#25) >>> >>> L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#26) >>> >>> L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#27) >>> >>> >>> >>> >>> >>> So, Is --reporting-binding shows one more level of logical CPU numbering? >>> >>> >>> >>> >>> >>> -Devendar >>> >>> >>> >>> >>> >>> From:devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph >>> Castain >>> Sent: Monday, April 20, 2015 3:52 PM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] binding output error >>> >>> >>> >>> Also, was this with HT's enabled? I'm wondering if the print code is >> incorrectly computing the core because it isn't correctly accounting >> for HT cpus. >>> >>> >>> >>> >>> >>> On Mon, Apr 20, 2015 at 3:49 PM, Jeff Squyres (jsquyres) >> <jsquy...@cisco.com> wrote: >>> >>> Ralph's the authority on this one, but just to be sure: are all nodes >>> the >> same topology? E.g., does adding "--hetero-nodes" to the mpirun >> command line fix the problem? >>> >>> >>> >>>> On Apr 20, 2015, at 9:29 AM, Elena Elkina <elena.elk...@itseez.com> >> wrote: >>>> >>>> Hi guys, >>>> >>>> I faced with an issue on our cluster related to mapping & binding >> policies on 1.8.5. >>>> >>>> The matter is that --report-bindings output doesn't correspond to >>>> the >> locale. It looks like there is a mistake on the output itself, because >> it just puts serial core number while that core can be >>> on another socket. For example, >>>> >>>> mpirun -np 2 --display-devel-map --report-bindings --map-by socket >> hostname >>>> Data for JOB [43064,1] offset 0 >>>> >>>> Mapper requested: NULL Last mapper: round_robin Mapping policy: >> BYSOCKET Ranking policy: SOCKET >>>> Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 >>>> Num new daemons: 0 New daemon starting vpid INVALID >>>> Num nodes: 1 >>>> >>>> Data for node: clx-orion-001 Launch id: -1 State: 2 >>>> Daemon: [[43064,0],0] Daemon launched: True >>>> Num slots: 28 Slots in use: 2 Oversubscribed: FALSE >>>> Num slots allocated: 28 Max slots: 0 >>>> Username on node: NULL >>>> Num procs: 2 Next node_rank: 2 >>>> Data for proc: [[43064,1],0] >>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 0-6,14-20 Bind location: 0 Binding: 0 >>>> Data for proc: [[43064,1],1] >>>> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 7-13,21-27 Bind location: 7 Binding: 7 >>>> [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >> [B/././././././././././././.][./././././././././././././.] >>>> [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]: >> [./././././././././././././.][B/././././././././././././.] >>>> >>>> The second process should be bound at core 7 (not core 14). >>>> >>>> >>>> Another example: >>>> mpirun -np 8 --display-devel-map --report-bindings --map-by core >> hostname >>>> Data for JOB [43202,1] offset 0 >>>> >>>> Mapper requested: NULL Last mapper: round_robin Mapping policy: >> BYCORE Ranking policy: CORE >>>> Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 >>>> Num new daemons: 0 New daemon starting vpid INVALID >>>> Num nodes: 1 >>>> >>>> Data for node: clx-orion-001 Launch id: -1 State: 2 >>>> Daemon: [[43202,0],0] Daemon launched: True >>>> Num slots: 28 Slots in use: 8 Oversubscribed: FALSE >>>> Num slots allocated: 28 Max slots: 0 >>>> Username on node: NULL >>>> Num procs: 8 Next node_rank: 8 >>>> Data for proc: [[43202,1],0] >>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 0 Bind location: 0 Binding: 0 >>>> Data for proc: [[43202,1],1] >>>> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 1 Bind location: 1 Binding: 1 >>>> Data for proc: [[43202,1],2] >>>> Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 2 Bind location: 2 Binding: 2 >>>> Data for proc: [[43202,1],3] >>>> Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 3 Bind location: 3 Binding: 3 >>>> Data for proc: [[43202,1],4] >>>> Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 4 Bind location: 4 Binding: 4 >>>> Data for proc: [[43202,1],5] >>>> Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 5 Bind location: 5 Binding: 5 >>>> Data for proc: [[43202,1],6] >>>> Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 6 Bind location: 6 Binding: 6 >>>> Data for proc: [[43202,1],7] >>>> Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 >>>> State: INITIALIZED Restarts: 0 App_context: 0 >> Locale: 14 Bind location: 14 Binding: 14 >>>> [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >> [B/././././././././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]: >> [./B/./././././././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]: >> [././B/././././././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]: >> [./././B/./././././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]: >> [././././B/././././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]: >> [./././././B/./././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]: >> [././././././B/././././././.][./././././././././././././.] >>>> [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]: >> [./././././././B/./././././.][./././././././././././././.] >>>> >>>> Rank 7 should be bound at core 14 instead of core 7 since core 7 is >>>> at >> another socket. >>>> >>>> Best regards, >>>> Elena >>>> >>>> >>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/04/17273.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/04/17282.php >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink >>> to >> this post: >> http://www.open-mpi.org/community/lists/devel/2015/04/17287.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/04/17291.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17295.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17297.php