Perhaps it's preferable to have an R and Python built against openblas. I think we just aren't doing best practice here since any user is using an R (3.3.3) built against openblas so this behavior is exhibiting itself even if the user does not need or intend to have blas (multithreaded without know it essentially).
One suggestion was to try and use MKL instead and perhaps it won't try to automatically gobble up all threads. I suppose we could make it mandatory for all users to set --cpus-per-task=1 and set the env variable OMP_NUM_THREADS=SLURM_CPUS_PER_TASK for all jobs, and users can adjust --cpus-per-task as needed, but something about that doesn't seem right, and sort of messy. I would prefer that if a user does not specify --cpus-per-task or even -ntasks, that the default behavior is one process on one core, with perhaps 2 threads since there's 2 "cpus" on each core, or maybe even just one thread on one core, unless otherwise specified. How do others handle this? Thanks for everyone's suggestions and help, it is more clear to me now that this is a software configuration/linking issue rather than a slurm/cgroup issue ---- I think! --mike -----Original Message----- From: Mike Cammilleri [mailto:mi...@stat.wisc.edu] Sent: Tuesday, April 25, 2017 2:03 PM To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: CPU config question Ok, perhaps I'm just misinterpreting the output of top and the nTH column. Checking the cgroup files, cpus are definitely being assigned properly. For example, I requested 5 tasks and 1 cpu per task in my batch submit script and the cgroup file looks good: # cat /sys/fs/cgroup/cpuset/slurm/uid_3691/job_23078/cpuset.cpus 7-9,31-33 And perhaps top is not a great tool for this, but checking on what's running we see my app and "Last used cpu" showing only the cpus listed in the cgroup: top - 13:51:37 up 1 day, 3:09, 1 user, load average: 129.28, 118.10, 94.61 Tasks: 577 total, 9 running, 568 sleeping, 0 stopped, 0 zombie %Cpu(s): 12.1 us, 19.8 sy, 0.0 ni, 68.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 13191924+total, 14846664 used, 11707258+free, 217772 buffers KiB Swap: 7998460 total, 0 used, 7998460 free. 3717504 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND nTH P 28082 mikec 20 0 7272048 2.921g 20652 R 574.9 2.3 61:10.32 python3.5 48 32 27963 mikec 20 0 167048 2580 1160 S 0.0 0.0 0:00.14 sshd 1 10 27964 mikec 20 0 17680 4660 1628 S 0.0 0.0 0:00.04 bash 1 11 28077 mikec 20 0 12668 1192 984 S 0.0 0.0 0:00.00 slurm_script 1 33 Now, invoking top with -H to see total threads: top - 13:51:52 up 1 day, 3:09, 1 user, load average: 133.97, 119.62, 95.48 Threads: 1100 total, 141 running, 959 sleeping, 0 stopped, 0 zombie %Cpu(s): 10.3 us, 23.2 sy, 0.0 ni, 66.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 13191924+total, 15488544 used, 11643070+free, 217780 buffers KiB Swap: 7998460 total, 0 used, 7998460 free. 3717504 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND nTH P 28264 mikec 20 0 7835724 3.503g 20652 R 24.5 2.8 3:02.28 python3.5 48 7 28263 mikec 20 0 7835724 3.503g 20652 R 22.2 2.8 3:33.63 python3.5 48 8 28267 mikec 20 0 7835724 3.503g 20652 R 17.9 2.8 1:38.36 python3.5 48 33 28261 mikec 20 0 7835724 3.503g 20652 R 17.5 2.8 4:29.45 python3.5 48 33 28268 mikec 20 0 7835724 3.503g 20652 R 14.9 2.8 1:11.00 python3.5 48 9 28266 mikec 20 0 7835724 3.503g 20652 R 13.6 2.8 2:02.47 python3.5 48 31 28262 mikec 20 0 7835724 3.503g 20652 R 13.2 2.8 3:52.93 python3.5 48 31 28272 mikec 20 0 7835724 3.503g 20652 R 13.2 2.8 1:00.27 python3.5 48 31 28271 mikec 20 0 7835724 3.503g 20652 R 12.9 2.8 0:57.80 python3.5 48 9 28265 mikec 20 0 7835724 3.503g 20652 R 12.6 2.8 2:39.34 python3.5 48 9 28269 mikec 20 0 7835724 3.503g 20652 R 12.6 2.8 1:03.31 python3.5 48 9 28270 mikec 20 0 7835724 3.503g 20652 R 12.6 2.8 1:01.53 python3.5 48 9 28273 mikec 20 0 7835724 3.503g 20652 R 12.6 2.8 0:55.99 python3.5 48 9 28274 mikec 20 0 7835724 3.503g 20652 R 12.6 2.8 0:56.46 python3.5 48 9 28279 mikec 20 0 7835724 3.503g 20652 R 12.2 2.8 0:55.16 python3.5 48 31 28284 mikec 20 0 7835724 3.503g 20652 R 12.2 2.8 0:57.23 python3.5 48 7 28285 mikec 20 0 7835724 3.503g 20652 R 12.2 2.8 0:50.40 python3.5 48 31 28293 mikec 20 0 7835724 3.503g 20652 R 12.2 2.8 0:50.15 python3.5 48 9 28286 mikec 20 0 7835724 3.503g 20652 R 11.9 2.8 0:55.09 python3.5 48 31 28287 mikec 20 0 7835724 3.503g 20652 R 11.9 2.8 0:51.74 python3.5 48 31 28288 mikec 20 0 7835724 3.503g 20652 R 11.9 2.8 0:53.44 python3.5 48 31 28297 mikec 20 0 7835724 3.503g 20652 R 11.6 2.8 0:49.79 python3.5 48 7 28082 mikec 20 0 7835724 3.503g 20652 R 11.2 2.8 3:44.64 python3.5 48 32 28295 mikec 20 0 7835724 3.503g 20652 R 11.2 2.8 0:50.21 python3.5 48 7 28296 mikec 20 0 7835724 3.503g 20652 R 11.2 2.8 0:51.25 python3.5 48 7 28306 mikec 20 0 7835724 3.503g 20652 R 11.2 2.8 1:00.18 python3.5 48 7 28307 mikec 20 0 7835724 3.503g 20652 R 11.2 2.8 0:56.47 python3.5 48 33 28298 mikec 20 0 7835724 3.503g 20652 R 10.9 2.8 0:48.07 python3.5 48 32 28300 mikec 20 0 7835724 3.503g 20652 R 10.9 2.8 1:03.04 python3.5 48 33 28302 mikec 20 0 7835724 3.503g 20652 R 10.9 2.8 0:49.76 python3.5 48 7 28303 mikec 20 0 7835724 3.503g 20652 R 10.9 2.8 0:55.57 python3.5 48 33 28304 mikec 20 0 7835724 3.503g 20652 R 10.9 2.8 0:56.79 python3.5 48 33 28305 mikec 20 0 7835724 3.503g 20652 R 10.9 2.8 0:54.71 python3.5 48 33 28275 mikec 20 0 7835724 3.503g 20652 R 10.6 2.8 0:58.99 python3.5 48 32 28277 mikec 20 0 7835724 3.503g 20652 R 10.6 2.8 1:03.28 python3.5 48 32 28278 mikec 20 0 7835724 3.503g 20652 R 10.6 2.8 1:02.33 python3.5 48 32 28280 mikec 20 0 7835724 3.503g 20652 R 10.6 2.8 1:02.54 python3.5 48 32 28281 mikec 20 0 7835724 3.503g 20652 R 10.6 2.8 0:55.99 python3.5 48 32 28299 mikec 20 0 7835724 3.503g 20652 R 10.6 2.8 0:51.67 python3.5 48 32 28276 mikec 20 0 7835724 3.503g 20652 R 10.3 2.8 0:57.60 python3.5 48 32 28283 mikec 20 0 7835724 3.503g 20652 R 9.9 2.8 0:50.44 python3.5 48 8 I suppose I'm still confused as to why the nTH column lists 48 threads when I asked for 1 cpu per task, and 5 tasks total. Instead I still see 48. Is what Paul is explaining to me that although a process is confined to a cpu, it can have X threads anyways? But I thought that slurm considerd each thread a cpu, so I should only have 48 total available on the entire node. Doing a quick look at pid's running for this job: # proc/28082/task# ls 28082 28264 28268 28272 28276 28280 28284 28288 28292 28296 28300 28304 28261 28265 28269 28273 28277 28281 28285 28289 28293 28297 28301 28305 28262 28266 28270 28274 28278 28282 28286 28290 28294 28298 28302 28306 28263 28267 28271 28275 28279 28283 28287 28291 28295 28299 28303 28307 # ls | wc -l 48 Further adding to my threads/cpus confusion, it doesn't look like each cpu is utilized even though I thought slurm considerd each thread a cpu, so in that regard the node looks okay: %Cpu0 : 93.2 us, 6.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 15.9 us, 25.2 sy, 0.0 ni, 58.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 1.6 us, 27.9 sy, 0.0 ni, 70.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 2.4 us, 51.6 sy, 0.0 ni, 46.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 32.9 us, 51.8 sy, 0.0 ni, 15.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.8 us, 14.1 sy, 0.0 ni, 85.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 4.8 us, 94.8 sy, 0.0 ni, 0.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 16.1 us, 83.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 30.9 us, 69.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 14.1 us, 85.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu16 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu17 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu18 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu19 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu20 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu21 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu22 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu23 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu24 : 1.6 us, 14.4 sy, 0.0 ni, 84.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu25 : 61.4 us, 24.1 sy, 0.0 ni, 14.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu26 : 71.1 us, 28.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu27 : 50.6 us, 49.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu28 : 19.3 us, 51.8 sy, 0.0 ni, 28.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu29 : 95.2 us, 4.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu30 : 8.0 us, 92.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu31 : 16.9 us, 83.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu32 : 13.3 us, 86.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu33 : 14.9 us, 85.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu34 : 0.0 us, 0.4 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu35 : 0.4 us, 0.0 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu36 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu37 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu38 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu39 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu40 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu41 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu42 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu43 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu44 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu45 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu46 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu47 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st Should I just let this go and trust that cgroups is handling this well enough? I'm about ready to. I realize a machine can have more threads sleeping, up to 2 million or so according to max-threads in linux, but actively running threads I would assume should be 48 max, and clearly we're over that at this time at around 141. -----Original Message----- From: Van Der Mark, Paul [mailto:pvanderm...@fsu.edu] Sent: Tuesday, April 25, 2017 9:57 AM To: slurm-dev <slurm-dev@schedmd.com> Cc: Mike Cammilleri <mi...@stat.wisc.edu> Subject: [slurm-dev] Re: CPU config question Hello Mike, Markus is absolutely right. If you request 1 core, then slurm will give you a cgroup with 1 core. That does not stop the user from running x threads, however, they will stay confined to that 1 core. Load is not a good indicator, since it an indication of the (linux) run-queue utilization and it doesn't care if some cores are "overloaded" and some are idle. In top you can press f (fields management) and select "Last Used Cpu" to see on which core a process is running. The issue you see is that openmp is ignoring any cgroup setting and counts all cores for OMP_NUM_THREADS. You probably have to set this variable by hand in your slurm script. Best, Paul On Tue, 2017-04-25 at 02:32 -0700, Markus Koeberl wrote: > On Monday 24 April 2017 22:04:49 Mike Cammilleri wrote: > > > > Thanks for your help on this. I've enabled cgroups plugin with these > > same settings > > > > CgroupAutomount=yes > > CgroupReleaseAgentDir="/etc/cgroup" > > CgroupMountpoint=/sys/fs/cgroup > > ConstrainCores=yes > > ConstrainDevices=yes > > ConstrainRAMSpace=yes > > ConstrainSwapSpace=yes > > > > And put cgroup.conf in /etc for our intalls. > > > > I can see in the slurm logging that it's reading in cgroup.conf. > > I've loaded the new slurm.conf and restarted all slurmd processes > > and ran scontrol reconfigure on the submit node. > > > > Memory seems to not be swapping anymore, however, I'm still having > > way too many threads get scheduled. I've tried many combinations of > > --cpus-per-task, --ntasks, cpu_bind=threads, whatever - and nothing > > seems to prevent any process from each having 48 threads according > > to 'top'. > > > > The most interesting thing I've found is that even a single R job is > > reporting 48 threads in 'top' (by pressing F in interactive mode > > when using top and selecting the nTH column to display). The only > > thing that seems to limit thread usage is setting OMP_NUM_THREADS > > env variable - this it will obey. But what we really need is a hard > > limit so no one user who thinks they're running a simple R job and > > requesting --ntasks 6, is actually getting 6*48 threads going at > > once and overloading the node. 48 threads would be the total number > > of "cpus" as the machine sees it logically. It's a 24 core machine > > with 2 threads on each core. > > > > Any ideas? Could this be a non slurm issue and something specific to > > our servers (running ubuntu 14.04 LTS)? I don't want to resort to > > having to turn off hyperthreading. > > If it is working all processes and threads should only be allowed to > run on the number of cpus asked for and not on the others. > > for example: > > # AMD FX-8370, 8 CPUs 8 threads (no hyperthreading) # all cpus slurm > is allowed to use cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus > 0-7 > # job 666554 of user with uidnumber 1044 (asked for 1 cpu) cat > /sys/fs/cgroup/cpuset/slurm/uid_1044/job_666554/cpuset.cpus > 0 > # all processes and threads of job 666554 can only run on cpu 0 > > # Intel E5-1620 v3, 4 CPUs 8 threads (with hyperthreading) # all cpus > slurm is allowed to use cat /sys/fs/cgroup/cpuset/slurm/cpuset.cpus > 0-7 > # job 758732 of user with uidnumber 1311 (asked for 1 cpu) cat > /sys/fs/cgroup/cpuset/slurm/uid_1311/job_758732/cpuset.cpus > 1,5 > # all processes and threads of job 758732 can only run on cpu 1 and 5 > (core 1 with 2 threads) > > > You may think of it like this: > For the process hierarchy in a cgroup the linux kernel runs a separate > scheduler. Therefore in theory processes in one cgroup will not affect > processes in another cgroup. Slurm creates a new cgroup for each > process and with ConstrainCores=yes pins it also to cpu cores. > > > Therefore the wrong number of processes and threads should not make > any problem. In your case (asking for 6 cpus with hyperthreading) only > 12 threads of 48 can run at the same time. > > Concerning the program: > The program could use the information in cpuset.cpus of the cgroup or > slurm environment variables to determiner how much threads may run > instead of taking the total number. > > > regards > Markus Köberl