Hi guys,

I faced with an issue on our cluster related to mapping & binding policies
on 1.8.5.

The matter is that --report-bindings output doesn't correspond to the
locale. It looks like there is a mistake on the output itself, because it
just puts serial core number while that core can be on another socket. For
example,

mpirun -np 2 --display-devel-map --report-bindings --map-by *socket*
hostname
 Data for JOB [43064,1] offset 0

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYSOCKET
 Ranking policy: SOCKET
 Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
  Num new daemons: 0 New daemon starting vpid INVALID
  Num nodes: 1

 Data for node: clx-orion-001  Launch id: -1 State: 2
  Daemon: [[43064,0],0] Daemon launched: True
  Num slots: 28 Slots in use: 2 Oversubscribed: FALSE
  Num slots allocated: 28 Max slots: 0
  Username on node: NULL
  Num procs: 2 Next node_rank: 2
  Data for proc: [[43064,1],0]
  Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
  State: INITIALIZED Restarts: 0 App_context: 0 *Locale: 0-6,14-20* Bind
location: 0 Binding: 0
  Data for proc: [[43064,1],1]
  Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
  State: INITIALIZED Restarts: 0 App_context: 0 *Locale: 7-13,21-27* Bind
location: 7 Binding: 7
[clx-orion-001:26951] MCW rank 0 bound to socket 0[*core 0[*hwt 0]]:
[B/././././././././././././.][./././././././././././././.]
[clx-orion-001:26951] MCW rank 1 bound to socket 1[*core 14*[hwt 0]]:
[./././././././././././././.][B/././././././././././././.]

The second process should be bound at core 7 (not core 14).


Another example:
mpirun -np 8 --display-devel-map --report-bindings --map-by core hostname
 Data for JOB [43202,1] offset 0

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYCORE
 Ranking policy: CORE
 Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
  Num new daemons: 0 New daemon starting vpid INVALID
  Num nodes: 1

 Data for node: clx-orion-001  Launch id: -1 State: 2
  Daemon: [[43202,0],0] Daemon launched: True
  Num slots: 28 Slots in use: 8 Oversubscribed: FALSE
  Num slots allocated: 28 Max slots: 0
  Username on node: NULL
  Num procs: 8 Next node_rank: 8
  Data for proc: [[43202,1],0]
  Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 0 Bind
location: 0 Binding:
0
  Data for proc: [[43202,1],1]
  Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 1 Bind
location: 1 Binding:
1
  Data for proc: [[43202,1],2]
  Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 2 Bind
location: 2 Binding:
2
  Data for proc: [[43202,1],3]
  Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 3 Bind
location: 3 Binding:
3
  Data for proc: [[43202,1],4]
  Pid: 0 Local rank: 4 Node rank: 4 App rank: 4
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 4 Bind
location: 4 Binding:
4
  Data for proc: [[43202,1],5]
  Pid: 0 Local rank: 5 Node rank: 5 App rank: 5
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 5 Bind
location: 5 Binding:
5
  Data for proc: [[43202,1],6]
  Pid: 0 Local rank: 6 Node rank: 6 App rank: 6
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: 6 Bind
location: 6 Binding:
6
  Data for proc: [[43202,1],7]
  Pid: 0 Local rank: 7 Node rank: 7 App rank: 7
  State: INITIALIZED Restarts: 0 App_context: 0 *Locale: 14* Bind location:
14 Binding: 14
[clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
[./B/./././././././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
[././B/././././././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
[./././B/./././././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
[././././B/././././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
[./././././B/./././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
[././././././B/././././././.][./././././././././././././.]
[clx-orion-001:27069] MCW rank 7 bound to socket 0[*core 7*[hwt 0]]:
[./././././././B/./././././.][./././././././././././././.]

Rank 7 should be bound at core 14 instead of core 7 since core 7 is at
another socket.

Best regards,
Elena

Reply via email to