[slurm-dev] Re: CPU config question

Mike Cammilleri Tue, 25 Apr 2017 13:57:06 -0700

Perhaps it's preferable to have an R and Python built against openblas. I think 
we just aren't doing best practice here since any user is using an R (3.3.3) 
built against openblas so this behavior is exhibiting itself even if the user 
does not need or intend to have blas (multithreaded without know it 
essentially).


One suggestion was to try and use MKL instead and perhaps it won't try to 
automatically gobble up all threads. I suppose we could make it mandatory for 
all users to set --cpus-per-task=1 and set the env variable 
OMP_NUM_THREADS=SLURM_CPUS_PER_TASK for all jobs, and users can adjust 
--cpus-per-task as needed, but something about that doesn't seem right, and 
sort of messy. I would prefer that if a user does not specify --cpus-per-task 
or even -ntasks, that the default behavior is one process on one core, with 
perhaps 2 threads since there's 2 "cpus" on each core, or maybe even just one 
thread on one core, unless otherwise specified. 

How do others handle this?

Thanks for everyone's suggestions and help, it is more clear to me now that 
this is a software configuration/linking issue rather than a slurm/cgroup issue 
---- I think!

--mike


-----Original Message-----
From: Mike Cammilleri [mailto:mi...@stat.wisc.edu] 
Sent: Tuesday, April 25, 2017 2:03 PM
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: CPU config question

Ok, perhaps I'm just misinterpreting the output of top and the nTH column. 
Checking the cgroup files, cpus are definitely being assigned properly. For 
example, I requested 5 tasks and 1 cpu per task in my batch submit script and 
the cgroup file looks good:


# cat /sys/fs/cgroup/cpuset/slurm/uid_3691/job_23078/cpuset.cpus
7-9,31-33

And perhaps top is not a great tool for this, but checking on what's running we 
see my app and "Last used cpu" showing only the cpus listed in the cgroup:

top - 13:51:37 up 1 day,  3:09,  1 user,  load average: 129.28, 118.10, 94.61
Tasks: 577 total,   9 running, 568 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.1 us, 19.8 sy,  0.0 ni, 68.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  13191924+total, 14846664 used, 11707258+free,   217772 buffers
KiB Swap:  7998460 total,        0 used,  7998460 free.  3717504 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
            nTH  P
28082 mikec     20   0 7272048 2.921g  20652 R 574.9  2.3  61:10.32 python3.5   
             48 32
27963 mikec     20   0  167048   2580   1160 S   0.0  0.0   0:00.14 sshd        
              1 10
27964 mikec     20   0   17680   4660   1628 S   0.0  0.0   0:00.04 bash        
              1 11
28077 mikec     20   0   12668   1192    984 S   0.0  0.0   0:00.00 
slurm_script              1 33


Now, invoking top with -H to see total threads:

top - 13:51:52 up 1 day,  3:09,  1 user,  load average: 133.97, 119.62, 95.48
Threads: 1100 total, 141 running, 959 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.3 us, 23.2 sy,  0.0 ni, 66.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  13191924+total, 15488544 used, 11643070+free,   217780 buffers
KiB Swap:  7998460 total,        0 used,  7998460 free.  3717504 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND      
            nTH  P
28264 mikec     20   0 7835724 3.503g  20652 R 24.5  2.8   3:02.28 python3.5    
             48  7
28263 mikec     20   0 7835724 3.503g  20652 R 22.2  2.8   3:33.63 python3.5    
             48  8
28267 mikec     20   0 7835724 3.503g  20652 R 17.9  2.8   1:38.36 python3.5    
             48 33
28261 mikec     20   0 7835724 3.503g  20652 R 17.5  2.8   4:29.45 python3.5    
             48 33
28268 mikec     20   0 7835724 3.503g  20652 R 14.9  2.8   1:11.00 python3.5    
             48  9
28266 mikec     20   0 7835724 3.503g  20652 R 13.6  2.8   2:02.47 python3.5    
             48 31
28262 mikec     20   0 7835724 3.503g  20652 R 13.2  2.8   3:52.93 python3.5    
             48 31
28272 mikec     20   0 7835724 3.503g  20652 R 13.2  2.8   1:00.27 python3.5    
             48 31
28271 mikec     20   0 7835724 3.503g  20652 R 12.9  2.8   0:57.80 python3.5    
             48  9
28265 mikec     20   0 7835724 3.503g  20652 R 12.6  2.8   2:39.34 python3.5    
             48  9
28269 mikec     20   0 7835724 3.503g  20652 R 12.6  2.8   1:03.31 python3.5    
             48  9
28270 mikec     20   0 7835724 3.503g  20652 R 12.6  2.8   1:01.53 python3.5    
             48  9
28273 mikec     20   0 7835724 3.503g  20652 R 12.6  2.8   0:55.99 python3.5    
             48  9
28274 mikec     20   0 7835724 3.503g  20652 R 12.6  2.8   0:56.46 python3.5    
             48  9
28279 mikec     20   0 7835724 3.503g  20652 R 12.2  2.8   0:55.16 python3.5    
             48 31
28284 mikec     20   0 7835724 3.503g  20652 R 12.2  2.8   0:57.23 python3.5    
             48  7
28285 mikec     20   0 7835724 3.503g  20652 R 12.2  2.8   0:50.40 python3.5    
             48 31
28293 mikec     20   0 7835724 3.503g  20652 R 12.2  2.8   0:50.15 python3.5    
             48  9
28286 mikec     20   0 7835724 3.503g  20652 R 11.9  2.8   0:55.09 python3.5    
             48 31
28287 mikec     20   0 7835724 3.503g  20652 R 11.9  2.8   0:51.74 python3.5    
             48 31
28288 mikec     20   0 7835724 3.503g  20652 R 11.9  2.8   0:53.44 python3.5    
             48 31
28297 mikec     20   0 7835724 3.503g  20652 R 11.6  2.8   0:49.79 python3.5    
             48  7
28082 mikec     20   0 7835724 3.503g  20652 R 11.2  2.8   3:44.64 python3.5    
             48 32
28295 mikec     20   0 7835724 3.503g  20652 R 11.2  2.8   0:50.21 python3.5    
             48  7
28296 mikec     20   0 7835724 3.503g  20652 R 11.2  2.8   0:51.25 python3.5    
             48  7
28306 mikec     20   0 7835724 3.503g  20652 R 11.2  2.8   1:00.18 python3.5    
             48  7
28307 mikec     20   0 7835724 3.503g  20652 R 11.2  2.8   0:56.47 python3.5    
             48 33
28298 mikec     20   0 7835724 3.503g  20652 R 10.9  2.8   0:48.07 python3.5    
             48 32
28300 mikec     20   0 7835724 3.503g  20652 R 10.9  2.8   1:03.04 python3.5    
             48 33
28302 mikec     20   0 7835724 3.503g  20652 R 10.9  2.8   0:49.76 python3.5    
             48  7
28303 mikec     20   0 7835724 3.503g  20652 R 10.9  2.8   0:55.57 python3.5    
             48 33
28304 mikec     20   0 7835724 3.503g  20652 R 10.9  2.8   0:56.79 python3.5    
             48 33
28305 mikec     20   0 7835724 3.503g  20652 R 10.9  2.8   0:54.71 python3.5    
             48 33
28275 mikec     20   0 7835724 3.503g  20652 R 10.6  2.8   0:58.99 python3.5    
             48 32
28277 mikec     20   0 7835724 3.503g  20652 R 10.6  2.8   1:03.28 python3.5    
             48 32
28278 mikec     20   0 7835724 3.503g  20652 R 10.6  2.8   1:02.33 python3.5    
             48 32
28280 mikec     20   0 7835724 3.503g  20652 R 10.6  2.8   1:02.54 python3.5    
             48 32
28281 mikec     20   0 7835724 3.503g  20652 R 10.6  2.8   0:55.99 python3.5    
             48 32
28299 mikec     20   0 7835724 3.503g  20652 R 10.6  2.8   0:51.67 python3.5    
             48 32
28276 mikec     20   0 7835724 3.503g  20652 R 10.3  2.8   0:57.60 python3.5    
             48 32
28283 mikec     20   0 7835724 3.503g  20652 R  9.9  2.8   0:50.44 python3.5    
             48  8


I suppose I'm still confused as to why the nTH column lists 48 threads when I 
asked for 1 cpu per task, and 5 tasks total. Instead I still see 48. Is what 
Paul is explaining to me that although a process is confined to a cpu, it can 
have X threads anyways? But I thought that slurm considerd each thread a cpu, 
so I should only have 48 total available on the entire node. Doing a quick look 
at pid's running for this job:

# proc/28082/task# ls
28082  28264  28268  28272  28276  28280  28284  28288  28292  28296  28300  
28304
28261  28265  28269  28273  28277  28281  28285  28289  28293  28297  28301  
28305
28262  28266  28270  28274  28278  28282  28286  28290  28294  28298  28302  
28306
28263  28267  28271  28275  28279  28283  28287  28291  28295  28299  28303  
28307 # ls | wc -l
48

Further adding to my threads/cpus confusion, it doesn't look like each cpu is 
utilized even though I thought slurm considerd each thread a cpu, so in that 
regard the node looks okay:


%Cpu0  : 93.2 us,  6.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 15.9 us, 25.2 sy,  0.0 ni, 58.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  1.6 us, 27.9 sy,  0.0 ni, 70.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  2.4 us, 51.6 sy,  0.0 ni, 46.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 32.9 us, 51.8 sy,  0.0 ni, 15.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.8 us, 14.1 sy,  0.0 ni, 85.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  4.8 us, 94.8 sy,  0.0 ni,  0.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 16.1 us, 83.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 30.9 us, 69.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 14.1 us, 85.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu24 :  1.6 us, 14.4 sy,  0.0 ni, 84.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu25 : 61.4 us, 24.1 sy,  0.0 ni, 14.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu26 : 71.1 us, 28.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu27 : 50.6 us, 49.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu28 : 19.3 us, 51.8 sy,  0.0 ni, 28.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu29 : 95.2 us,  4.8 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu30 :  8.0 us, 92.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu31 : 16.9 us, 83.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu32 : 13.3 us, 86.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu33 : 14.9 us, 85.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu34 :  0.0 us,  0.4 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu35 :  0.4 us,  0.0 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu36 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu37 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu38 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu39 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu40 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu41 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu42 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu43 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu44 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu45 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu46 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu47 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st


Should I just let this go and trust that cgroups is handling this well enough? 
I'm about ready to. I realize a machine can have more threads sleeping, up to 2 
million or so according to max-threads in linux, but actively running threads I 
would assume should be 48 max, and clearly we're over that at this time at 
around 141.

-----Original Message-----
From: Van Der Mark, Paul [mailto:pvanderm...@fsu.edu]
Sent: Tuesday, April 25, 2017 9:57 AM
To: slurm-dev <slurm-dev@schedmd.com>
Cc: Mike Cammilleri <mi...@stat.wisc.edu>
Subject: [slurm-dev] Re: CPU config question

Hello Mike,

Markus is absolutely right. If you request 1 core, then slurm will give you a 
cgroup with 1 core. That does not stop the user from running x threads, 
however, they will stay confined to that 1 core. Load is not a good indicator, 
since it an indication of the (linux) run-queue utilization and it doesn't care 
if some cores are "overloaded" and some are idle. In top you can press f 
(fields management) and select "Last Used Cpu" to see on which core a process 
is running.

The issue you see is that openmp is ignoring any cgroup setting and counts all 
cores for OMP_NUM_THREADS. You probably have to set this variable by hand in 
your slurm script.  

Best,
Paul

On Tue, 2017-04-25 at 02:32 -0700, Markus Koeberl wrote:
> On Monday 24 April 2017 22:04:49 Mike Cammilleri wrote:
> > 
> > Thanks for your help on this. I've enabled cgroups plugin with these 
> > same settings
> > 
> > CgroupAutomount=yes
> > CgroupReleaseAgentDir="/etc/cgroup"
> > CgroupMountpoint=/sys/fs/cgroup
> > ConstrainCores=yes
> > ConstrainDevices=yes
> > ConstrainRAMSpace=yes
> > ConstrainSwapSpace=yes
> > 
> > And put cgroup.conf in /etc for our intalls.
> > 
> > I can see in the slurm logging that it's reading in cgroup.conf.
> > I've loaded the new slurm.conf and restarted all slurmd processes 
> > and ran scontrol reconfigure on the submit node.
> > 
> > Memory seems to not be swapping anymore, however, I'm still having 
> > way too many threads get scheduled. I've tried many combinations of 
> > --cpus-per-task, --ntasks, cpu_bind=threads, whatever - and nothing 
> > seems to prevent any process from each having 48 threads according 
> > to 'top'.
> > 
> > The most interesting thing I've found is that even a single R job is 
> > reporting 48 threads in 'top' (by pressing F in interactive mode 
> > when using top and selecting the nTH column to display). The only 
> > thing that seems to limit thread usage is setting OMP_NUM_THREADS 
> > env variable - this it will obey. But what we really need is a hard 
> > limit so no one user who thinks they're running a simple  R job and 
> > requesting --ntasks 6, is actually getting 6*48 threads going at 
> > once and overloading the node. 48 threads would be the total number 
> > of "cpus" as the machine sees it logically. It's a 24 core machine 
> > with 2 threads on each core.
> > 
> > Any ideas? Could this be a non slurm issue and something specific to 
> > our servers (running ubuntu 14.04 LTS)? I don't want to resort to 
> > having to turn off hyperthreading.
> 
> If it is working all processes and threads should only be allowed to 
> run on the number of cpus asked for and not on the others.
> 
> for example:
> 
> # AMD FX-8370, 8 CPUs 8 threads (no hyperthreading) # all cpus slurm 
> is allowed to use cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
> 0-7
> # job 666554 of user with uidnumber 1044 (asked for 1 cpu) cat 
> /sys/fs/cgroup/cpuset/slurm/uid_1044/job_666554/cpuset.cpus
> 0
> # all processes and threads of job 666554 can only run on cpu 0
> 
> # Intel E5-1620 v3, 4 CPUs 8 threads (with hyperthreading) # all cpus 
> slurm is allowed to use cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus
> 0-7
> # job 758732 of user with uidnumber 1311 (asked for 1 cpu) cat 
> /sys/fs/cgroup/cpuset/slurm/uid_1311/job_758732/cpuset.cpus
> 1,5
> # all processes and threads of job 758732 can only run on cpu 1 and 5 
> (core 1 with 2 threads)
> 
> 
> You may think of it like this:
> For the process hierarchy in a cgroup the linux kernel runs a separate 
> scheduler. Therefore in theory processes in one cgroup will not affect 
> processes in another cgroup. Slurm creates a new cgroup for each 
> process and with ConstrainCores=yes pins it also to cpu cores.
> 
> 
> Therefore the wrong number of processes and threads should not make 
> any problem. In your case (asking for 6 cpus with hyperthreading) only
> 12 threads of 48 can run at the same time.
> 
> Concerning the program:
> The program could use the information in cpuset.cpus of the cgroup or 
> slurm environment variables to determiner how much threads may run 
> instead of taking the total number.
> 
> 
> regards
> Markus Köberl

[slurm-dev] Re: CPU config question

Reply via email to