Le 03/10/2013 02:56, Panos Labropoulos a écrit :
> Hallo,
>
>
> I initially posted this at us...@open-mpi.org <mailto:us...@open-mpi.org>.
>
> We seem to be unable to to set the cpu binding on a cluster consisting
> of Dell M420/M610 systems:
>
> [jallan@hpc21 ~]$ cat report-bindings.sh #!/bin/sh
>
> bitmap=`hwloc-bind --get -p`
> friendly=`hwloc-calc -p -H socket.core.pu $bitmap`
>
> echo "MCW rank $OMPI_COMM_WORLD_RANK (`hostname`): $friendly"
> exit 0
>
>
> [jallan@hpc27 ~]$ hwloc-bind -v  socket:0.core:0 -l ./report-bindings.sh
> using object #0 depth 2 below cpuset 0x000000ff
> using object #0 depth 6 below cpuset 0x00000080
> adding 0x00000080 to 0x0
> adding 0x00000080 to 0x0
> assuming the command starts at ./report-bindings.sh
> binding on cpu set 0x00000080
> MCW rank  (hpc27): Socket:0.Core:10.PU:7
> [jallan@hpc27 ~]$ hwloc-bind -v  socket:1.core:0 -l ./report-bindings.sh
> object #1 depth 2 (type socket) below cpuset 0x000000ff does not exist
> adding 0x0 to 0x0
> assuming the command starts at ./report-bindings.sh
> MCW rank  (hpc27): Socket:0.Core:10.PU:7
>
>
> The topology of this system looks a bit strange:
>
> [jallan@hpc21 ~]$ lstopo --no-io
> Machine (24GB)
>  NUMANode L#0 (P#0 24GB)
>  NUMANode L#1 (P#1) + Socket L#0 + L3 L#0 (15MB) + L2 L#0 (256KB) + L1
> L#0 (32KB) + Core L#0 + PU L#0 (P#11)
> [jallan@hpc21 ~]$


You likely have some Linux cpuset that restrict the available CPUs.
That's why the first socket object doesn't appear in lstopo above. And
that's why "socket:1" fails in other commands: there's no socket with
logical index 1.

If you're allocating jobs with a batch scheduler, the problem will go
away if you reserve all cores of the node instead of a single one.

If you really want to play with manual binding on that restricted
platform, you also have to manually play with the unavailable resources.

Otherwise you can generate the entire topology with "lstopo
--whole-system foo.xml" and then use it with "normal" socket numbers:
"hwloc-bind -i foo.xml socket:1.core:0 etc". You won't get errors about
objects anymore, but you may get new errors about failures to bind if
you try to bind to objects outside the restricted topology.

Brice

Reply via email to