HT is not enabled.  All node are same topo . This is reproducible even on 
single node.

I ran osu latency to see if it is really is mapped to other socket or not with 
–map-by socket.  It looks likes mapping is correct as per latency test.

$mpirun -np 2 -report-bindings -map-by socket  
/hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency
[clx-orion-001:10084] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
[B/././././././././././././.][./././././././././././././.]
[clx-orion-001:10084] MCW rank 1 bound to socket 1[core 14[hwt 0]]: 
[./././././././././././././.][B/././././././././././././.]
# OSU MPI Latency Test v4.4.1
# Size          Latency (us)
0                       0.50
1                       0.50
2                       0.50
4                       0.49


$mpirun -np 2 -report-bindings -cpu-set 1,7 
/hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency
[clx-orion-001:10155] MCW rank 0 bound to socket 0[core 1[hwt 0]]: 
[./B/./././././././././././.][./././././././././././././.]
[clx-orion-001:10155] MCW rank 1 bound to socket 0[core 7[hwt 0]]: 
[./././././././B/./././././.][./././././././././././././.]
# OSU MPI Latency Test v4.4.1
# Size          Latency (us)
0                       0.23
1                       0.24
2                       0.23
4                       0.22
8                       0.23

Both hwloc and /proc/cpuinfo indicates following cpu numbering
socket 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20
socket 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27

$hwloc-info -f
Machine (256GB)
  NUMANode L#0 (P#0 128GB) + Socket L#0 + L3 L#0 (35MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#14)
    L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#15)
    L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#16)
    L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#17)
    L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#18)
    L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#19)
    L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#20)
  NUMANode L#1 (P#1 128GB) + Socket L#1 + L3 L#1 (35MB)
    L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#7)
    L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#8)
    L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#9)
    L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#10)
    L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#11)
    L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#12)
    L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#13)
    L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#21)
    L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#22)
    L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#23)
    L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#24)
    L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#25)
    L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#26)
    L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#27)


So, Is --reporting-binding shows one more level of logical CPU numbering?


-Devendar


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Monday, April 20, 2015 3:52 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] binding output error

Also, was this with HT's enabled? I'm wondering if the print code is 
incorrectly computing the core because it isn't correctly accounting for HT 
cpus.


On Mon, Apr 20, 2015 at 3:49 PM, Jeff Squyres (jsquyres) 
<jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
Ralph's the authority on this one, but just to be sure: are all nodes the same 
topology? E.g., does adding "--hetero-nodes" to the mpirun command line fix the 
problem?


> On Apr 20, 2015, at 9:29 AM, Elena Elkina 
> <elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:
>
> Hi guys,
>
> I faced with an issue on our cluster related to mapping & binding policies on 
> 1.8.5.
>
> The matter is that --report-bindings output doesn't correspond to the locale. 
> It looks like there is a mistake on the output itself, because it just puts 
> serial core number while that core can be on another socket. For example,
>
> mpirun -np 2 --display-devel-map --report-bindings --map-by socket hostname
>  Data for JOB [43064,1] offset 0
>
>  Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYSOCKET  
> Ranking policy: SOCKET
>  Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>       Num new daemons: 0      New daemon starting vpid INVALID
>       Num nodes: 1
>
>  Data for node: clx-orion-001         Launch id: -1   State: 2
>       Daemon: [[43064,0],0]   Daemon launched: True
>       Num slots: 28   Slots in use: 2 Oversubscribed: FALSE
>       Num slots allocated: 28 Max slots: 0
>       Username on node: NULL
>       Num procs: 2    Next node_rank: 2
>       Data for proc: [[43064,1],0]
>               Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 0-6,14-20       Bind location: 0        Binding: 0
>       Data for proc: [[43064,1],1]
>               Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 7-13,21-27      Bind location: 7        Binding: 7
> [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././././././././././.][./././././././././././././.]
> [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]: 
> [./././././././././././././.][B/././././././././././././.]
>
> The second process should be bound at core 7 (not core 14).
>
>
> Another example:
> mpirun -np 8 --display-devel-map --report-bindings --map-by core hostname
>  Data for JOB [43202,1] offset 0
>
>  Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYCORE  
> Ranking policy: CORE
>  Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>       Num new daemons: 0      New daemon starting vpid INVALID
>       Num nodes: 1
>
>  Data for node: clx-orion-001         Launch id: -1   State: 2
>       Daemon: [[43202,0],0]   Daemon launched: True
>       Num slots: 28   Slots in use: 8 Oversubscribed: FALSE
>       Num slots allocated: 28 Max slots: 0
>       Username on node: NULL
>       Num procs: 8    Next node_rank: 8
>       Data for proc: [[43202,1],0]
>               Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 0       Bind location: 0        Binding: 0
>       Data for proc: [[43202,1],1]
>               Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 1       Bind location: 1        Binding: 1
>       Data for proc: [[43202,1],2]
>               Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 2       Bind location: 2        Binding: 2
>       Data for proc: [[43202,1],3]
>               Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 3       Bind location: 3        Binding: 3
>       Data for proc: [[43202,1],4]
>               Pid: 0  Local rank: 4   Node rank: 4    App rank: 4
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 4       Bind location: 4        Binding: 4
>       Data for proc: [[43202,1],5]
>               Pid: 0  Local rank: 5   Node rank: 5    App rank: 5
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 5       Bind location: 5        Binding: 5
>       Data for proc: [[43202,1],6]
>               Pid: 0  Local rank: 6   Node rank: 6    App rank: 6
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 6       Bind location: 6        Binding: 6
>       Data for proc: [[43202,1],7]
>               Pid: 0  Local rank: 7   Node rank: 7    App rank: 7
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 14      Bind location: 14       Binding: 14
> [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]: 
> [./B/./././././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]: 
> [././B/././././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]: 
> [./././B/./././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]: 
> [././././B/././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]: 
> [./././././B/./././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]: 
> [././././././B/././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]: 
> [./././././././B/./././././.][./././././././././././././.]
>
> Rank 7 should be bound at core 14 instead of core 7 since core 7 is at 
> another socket.
>
> Best regards,
> Elena
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org<mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17273.php


--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/04/17282.php

Reply via email to