This isn't a hwloc problem exactly, but maybe you can shed some insight.

We have some 4 socket 10 core = 40 core nodes, HT off:

depth 0:        1 Machine (type #1)
 depth 1:       4 NUMANodes (type #2)
  depth 2:      4 Sockets (type #3)
   depth 3:     4 Caches (type #4)
    depth 4:    40 Caches (type #4)
     depth 5:   40 Caches (type #4)
      depth 6:  40 Cores (type #5)
       depth 7: 40 PUs (type #6)


We run rhel 6.3  we use torque to create cgroups for jobs.  I get the following 
cgroup for this job  all 12 cores for the job are on one node:
cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus 
0-1,4-5,8,12,16,20,24,28,32,36

Not all nicely spaced, but 12 cores

I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 
cores:
mpirun ./stream

45521 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream            
 
45522 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   1:46.08 stream            
 
45525 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream            
 
45526 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   1:46.07 stream            
 
45527 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream            
 
45528 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream            
 
45532 brockp    20   0 1885m 1.8g  456 R 100.0  0.2   1:46.05 stream            
 
45529 brockp    20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream             
 
45530 brockp    20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream             
 
45531 brockp    20   0 1885m 1.8g  456 R 33.6  0.2   1:20.89 stream             
 
45523 brockp    20   0 1885m 1.8g  456 R 32.8  0.2   1:20.90 stream             
 
45524 brockp    20   0 1885m 1.8g  456 R 32.8  0.2   1:20.89 stream   

Note the processes that are not running at 100% cpu, 

hwloc-bind  --get --pid 45523
0x00000011,0x11111133
<the same mask is reported for all 12 processes>

hwloc-calc 0x00000011,0x11111133 --intersect PU
0,1,2,3,4,5,6,7,8,9,10,11

So all ranks in the job should see all 12 cores.  The same cgroup is reported 
by /proc/<pid>/cgroup

Not only that I can make things work by forcing binding in the mpi launcher:
mpirun -bind-to-core ./stream

46886 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream             
 
46887 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream             
 
46888 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream             
 
46889 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream             
 
46890 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream             
 
46891 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream             
 
46892 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream             
 
46893 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream             
 
46894 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream             
 
46895 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream             
 
46896 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.46 stream             
 
46897 brockp    20   0 1885m 1.8g  456 R 99.8  0.2   0:15.46 stream 

Things are now working as expected, and I should stress this is inside the same 
torque job and cgroup that I started with.

A multi threaded version of the code does use close to 12 cores as expected.

If I cervumvent out batch system and the cgroups a normal mpirun ./stream  does 
start 12 processes that consume a full 100% core. 

Thoughts?  This is really odd linux scheduler behavior.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985




Reply via email to