HT is not enabled. All node are same topo . This is reproducible even on single node.
I ran osu latency to see if it is really is mapped to other socket or not with –map-by socket. It looks likes mapping is correct as per latency test. $mpirun -np 2 -report-bindings -map-by socket /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency [clx-orion-001:10084] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././.][./././././././././././././.] [clx-orion-001:10084] MCW rank 1 bound to socket 1[core 14[hwt 0]]: [./././././././././././././.][B/././././././././././././.] # OSU MPI Latency Test v4.4.1 # Size Latency (us) 0 0.50 1 0.50 2 0.50 4 0.49 $mpirun -np 2 -report-bindings -cpu-set 1,7 /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency [clx-orion-001:10155] MCW rank 0 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././.][./././././././././././././.] [clx-orion-001:10155] MCW rank 1 bound to socket 0[core 7[hwt 0]]: [./././././././B/./././././.][./././././././././././././.] # OSU MPI Latency Test v4.4.1 # Size Latency (us) 0 0.23 1 0.24 2 0.23 4 0.22 8 0.23 Both hwloc and /proc/cpuinfo indicates following cpu numbering socket 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20 socket 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27 $hwloc-info -f Machine (256GB) NUMANode L#0 (P#0 128GB) + Socket L#0 + L3 L#0 (35MB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#14) L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#15) L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#16) L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#17) L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#18) L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#19) L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#20) NUMANode L#1 (P#1 128GB) + Socket L#1 + L3 L#1 (35MB) L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#7) L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#8) L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#9) L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#10) L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#11) L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#12) L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#13) L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#23) L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#27) So, Is --reporting-binding shows one more level of logical CPU numbering? -Devendar From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Monday, April 20, 2015 3:52 PM To: Open MPI Developers Subject: Re: [OMPI devel] binding output error Also, was this with HT's enabled? I'm wondering if the print code is incorrectly computing the core because it isn't correctly accounting for HT cpus. On Mon, Apr 20, 2015 at 3:49 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote: Ralph's the authority on this one, but just to be sure: are all nodes the same topology? E.g., does adding "--hetero-nodes" to the mpirun command line fix the problem? > On Apr 20, 2015, at 9:29 AM, Elena Elkina > <elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote: > > Hi guys, > > I faced with an issue on our cluster related to mapping & binding policies on > 1.8.5. > > The matter is that --report-bindings output doesn't correspond to the locale. > It looks like there is a mistake on the output itself, because it just puts > serial core number while that core can be on another socket. For example, > > mpirun -np 2 --display-devel-map --report-bindings --map-by socket hostname > Data for JOB [43064,1] offset 0 > > Mapper requested: NULL Last mapper: round_robin Mapping policy: BYSOCKET > Ranking policy: SOCKET > Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > Num new daemons: 0 New daemon starting vpid INVALID > Num nodes: 1 > > Data for node: clx-orion-001 Launch id: -1 State: 2 > Daemon: [[43064,0],0] Daemon launched: True > Num slots: 28 Slots in use: 2 Oversubscribed: FALSE > Num slots allocated: 28 Max slots: 0 > Username on node: NULL > Num procs: 2 Next node_rank: 2 > Data for proc: [[43064,1],0] > Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 0-6,14-20 Bind location: 0 Binding: 0 > Data for proc: [[43064,1],1] > Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 7-13,21-27 Bind location: 7 Binding: 7 > [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././././././././.][./././././././././././././.] > [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]: > [./././././././././././././.][B/././././././././././././.] > > The second process should be bound at core 7 (not core 14). > > > Another example: > mpirun -np 8 --display-devel-map --report-bindings --map-by core hostname > Data for JOB [43202,1] offset 0 > > Mapper requested: NULL Last mapper: round_robin Mapping policy: BYCORE > Ranking policy: CORE > Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > Num new daemons: 0 New daemon starting vpid INVALID > Num nodes: 1 > > Data for node: clx-orion-001 Launch id: -1 State: 2 > Daemon: [[43202,0],0] Daemon launched: True > Num slots: 28 Slots in use: 8 Oversubscribed: FALSE > Num slots allocated: 28 Max slots: 0 > Username on node: NULL > Num procs: 8 Next node_rank: 8 > Data for proc: [[43202,1],0] > Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 0 Bind location: 0 Binding: 0 > Data for proc: [[43202,1],1] > Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 1 Bind location: 1 Binding: 1 > Data for proc: [[43202,1],2] > Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 2 Bind location: 2 Binding: 2 > Data for proc: [[43202,1],3] > Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 3 Bind location: 3 Binding: 3 > Data for proc: [[43202,1],4] > Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 4 Bind location: 4 Binding: 4 > Data for proc: [[43202,1],5] > Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 5 Bind location: 5 Binding: 5 > Data for proc: [[43202,1],6] > Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 6 Bind location: 6 Binding: 6 > Data for proc: [[43202,1],7] > Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 14 Bind location: 14 Binding: 14 > [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > [./B/./././././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]: > [././B/././././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]: > [./././B/./././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]: > [././././B/././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]: > [./././././B/./././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]: > [././././././B/././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]: > [./././././././B/./././././.][./././././././././././././.] > > Rank 7 should be bound at core 14 instead of core 7 since core 7 is at > another socket. > > Best regards, > Elena > > > _______________________________________________ > devel mailing list > de...@open-mpi.org<mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17273.php -- Jeff Squyres jsquy...@cisco.com<mailto:jsquy...@cisco.com> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ _______________________________________________ devel mailing list de...@open-mpi.org<mailto:de...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/04/17282.php