Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-05 Thread Samuel Thibault
Brice Goglin, le Mon 05 Nov 2012 23:23:42 +0100, a écrit :
> top can also sort by the last used CPU. Type f to enter the config menu,
> hilight the "last cpu" line, and hit 's' to make it the sort column.

With older versions of top, type F, then j, then space.

Samuel


Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-02 Thread Brice Goglin
Le 02/11/2012 21:22, Brice Goglin a écrit :
> hwloc-bind --get-last-cpu-location --pid  should give the same
> info but it seems broken on my machine right now, going to debug.

Actually, that works fine once you try it on a non-multithreaded program
that uses all cores :)

So you can use top or hwloc-bind --get-last-cpu-location --pid  to
find out where each process runs.

Brice



Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-02 Thread Brice Goglin
Le 02/11/2012 21:03, Brock Palen a écrit :
> This isn't a hwloc problem exactly, but maybe you can shed some insight.
>
> We have some 4 socket 10 core = 40 core nodes, HT off:
>
> depth 0:  1 Machine (type #1)
>  depth 1: 4 NUMANodes (type #2)
>   depth 2:4 Sockets (type #3)
>depth 3:   4 Caches (type #4)
> depth 4:  40 Caches (type #4)
>  depth 5: 40 Caches (type #4)
>   depth 6:40 Cores (type #5)
>depth 7:   40 PUs (type #6)
>
>
> We run rhel 6.3  we use torque to create cgroups for jobs.  I get the 
> following cgroup for this job  all 12 cores for the job are on one node:
> cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus 
> 0-1,4-5,8,12,16,20,24,28,32,36
>
> Not all nicely spaced, but 12 cores
>
> I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 
> cores:
> mpirun ./stream
>
> 45521 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream  
>
> 45522 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.08 stream  
>
> 45525 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream  
>
> 45526 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.07 stream  
>
> 45527 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream  
>
> 45528 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream  
>
> 45532 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.05 stream  
>
> 45529 brockp20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream   
>
> 45530 brockp20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream   
>
> 45531 brockp20   0 1885m 1.8g  456 R 33.6  0.2   1:20.89 stream   
>
> 45523 brockp20   0 1885m 1.8g  456 R 32.8  0.2   1:20.90 stream   
>
> 45524 brockp20   0 1885m 1.8g  456 R 32.8  0.2   1:20.89 stream   
>
> Note the processes that are not running at 100% cpu, 
>
> hwloc-bind  --get --pid 45523
> 0x0011,0x1133
> 

Hello Brock,

I don't see any helpful to answer here :/

Do you know which core is overloaded and which (two?) cores are idle?
Does that change during one run or from one run to another?
Pressing 1 in top should give that information in the very first lines.
Then, you can try to binding another process to one of the idle cores,
to see if the kernel accepts that.

You can also press "f" and "j" (or "f" and use arrows and space to
select "last used cpu") to add a "P" line which tells you the last CPU
used by each process.
hwloc-bind --get-last-cpu-location --pid  should give the same info
but it seems broken on my machine right now, going to debug.

One thing to check would be to run more than 12 cores and check where
the kernel puts them. If it keeps ignoring two cores, that would be funny :)

Brice



[hwloc-users] Strange binding issue on 40 core nodes and cgroups

2012-11-02 Thread Brock Palen
This isn't a hwloc problem exactly, but maybe you can shed some insight.

We have some 4 socket 10 core = 40 core nodes, HT off:

depth 0:1 Machine (type #1)
 depth 1:   4 NUMANodes (type #2)
  depth 2:  4 Sockets (type #3)
   depth 3: 4 Caches (type #4)
depth 4:40 Caches (type #4)
 depth 5:   40 Caches (type #4)
  depth 6:  40 Cores (type #5)
   depth 7: 40 PUs (type #6)


We run rhel 6.3  we use torque to create cgroups for jobs.  I get the following 
cgroup for this job  all 12 cores for the job are on one node:
cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus 
0-1,4-5,8,12,16,20,24,28,32,36

Not all nicely spaced, but 12 cores

I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 
cores:
mpirun ./stream

45521 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream
 
45522 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.08 stream
 
45525 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.72 stream
 
45526 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.07 stream
 
45527 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream
 
45528 brockp20   0 1885m 1.8g  456 R 100.0  0.2   4:02.71 stream
 
45532 brockp20   0 1885m 1.8g  456 R 100.0  0.2   1:46.05 stream
 
45529 brockp20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream 
 
45530 brockp20   0 1885m 1.8g  456 R 99.2  0.2   4:02.70 stream 
 
45531 brockp20   0 1885m 1.8g  456 R 33.6  0.2   1:20.89 stream 
 
45523 brockp20   0 1885m 1.8g  456 R 32.8  0.2   1:20.90 stream 
 
45524 brockp20   0 1885m 1.8g  456 R 32.8  0.2   1:20.89 stream   

Note the processes that are not running at 100% cpu, 

hwloc-bind  --get --pid 45523
0x0011,0x1133


hwloc-calc 0x0011,0x1133 --intersect PU
0,1,2,3,4,5,6,7,8,9,10,11

So all ranks in the job should see all 12 cores.  The same cgroup is reported 
by /proc//cgroup

Not only that I can make things work by forcing binding in the mpi launcher:
mpirun -bind-to-core ./stream

46886 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream 
 
46887 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream 
 
46888 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream 
 
46889 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.49 stream 
 
46890 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream 
 
46891 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.48 stream 
 
46892 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46893 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46894 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46895 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.47 stream 
 
46896 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.46 stream 
 
46897 brockp20   0 1885m 1.8g  456 R 99.8  0.2   0:15.46 stream 

Things are now working as expected, and I should stress this is inside the same 
torque job and cgroup that I started with.

A multi threaded version of the code does use close to 12 cores as expected.

If I cervumvent out batch system and the cgroups a normal mpirun ./stream  does 
start 12 processes that consume a full 100% core. 

Thoughts?  This is really odd linux scheduler behavior.

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985