I started a thread on understand QOS, but quickly realised I had made a fundamental error in my configuration. I fixed that problem last week. (ref: https://groups.google.com/forum/#!msg/slurm-devel/dqL30WwmrmU/SoOMHmRVDAAJ )
Despite these changes, the issue remains, so I would like to ask again, with more background information and more analysis. Desired scenario: That any one user can only ever have jobs adding up to 90 CPUs at a time. They can submit requests for more than this, but their running jobs will max out at 90 and the rest of their jobs will be put in queue. A CPU being defined as a thread in a system that has 2 sockets, each with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs) Current scenario: users are getting every CPU they have requested, blocking other users from the partitions. Our users are able to use 40 CPUs per node, so we know that every thread is available as a consumable resource, as we wanted. When I use sinfo -o %C, the results re per CPU utilization reflect that the thread is being used as the CPU measure. Yet, as noted above, when I do an squeue, I see that users have jobs running with more than 90 CPUs in total. squeue that shows allocated CPUs. Note that both running users have more than 90 CPUS each (threads): $ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l" CPUS QOS JOBID PARTITION NAME USER STATE TIME TIME_LIMI 8 normal 193424 prod Halo3 kamarasi PENDING 0:00 1-00:00:00 8 normal 193423 prod Halo3 kamarasi PENDING 0:00 1-00:00:00 8 normal 193422 prod Halo3 kamarasi PENDING 0:00 1-00:00:00 20 normal 189360 prod MuVd_WGS lij@pete RUNNING 23:49:15 6-00:00:00 20 normal 189353 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 6-00:00:00 20 normal 189354 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 6-00:00:00 20 normal 189356 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 6-00:00:00 20 normal 189358 prod MuVd_WGS lij@pete RUNNING 4-18:43:26 6-00:00:00 8 normal 193417 prod Halo3 kamarasi RUNNING 0:01 1-00:00:00 8 normal 193416 prod Halo3 kamarasi RUNNING 0:18 1-00:00:00 8 normal 193415 prod Halo3 kamarasi RUNNING 0:19 1-00:00:00 8 normal 193414 prod Halo3 kamarasi RUNNING 0:47 1-00:00:00 8 normal 193413 prod Halo3 kamarasi RUNNING 2:08 1-00:00:00 8 normal 193412 prod Halo3 kamarasi RUNNING 2:09 1-00:00:00 8 normal 193411 prod Halo3 kamarasi RUNNING 3:24 1-00:00:00 8 normal 193410 prod Halo3 kamarasi RUNNING 5:04 1-00:00:00 8 normal 193409 prod Halo3 kamarasi RUNNING 5:06 1-00:00:00 8 normal 193408 prod Halo3 kamarasi RUNNING 7:40 1-00:00:00 8 normal 193407 prod Halo3 kamarasi RUNNING 10:48 1-00:00:00 8 normal 193406 prod Halo3 kamarasi RUNNING 10:50 1-00:00:00 8 normal 193405 prod Halo3 kamarasi RUNNING 11:34 1-00:00:00 8 normal 193404 prod Halo3 kamarasi RUNNING 12:00 1-00:00:00 8 normal 193403 prod Halo3 kamarasi RUNNING 12:10 1-00:00:00 8 normal 193402 prod Halo3 kamarasi RUNNING 12:21 1-00:00:00 8 normal 193401 prod Halo3 kamarasi RUNNING 12:40 1-00:00:00 8 normal 193400 prod Halo3 kamarasi RUNNING 17:02 1-00:00:00 8 normal 193399 prod Halo3 kamarasi RUNNING 21:03 1-00:00:00 8 normal 193396 prod Halo3 kamarasi RUNNING 22:01 1-00:00:00 8 normal 193394 prod Halo3 kamarasi RUNNING 23:40 1-00:00:00 8 normal 193393 prod Halo3 kamarasi RUNNING 25:21 1-00:00:00 8 normal 193390 prod Halo3 kamarasi RUNNING 25:58 1-00:00:00 Yet when I run squeue that shows Sockets/Cores/Threads as S/C/T: squeue -o "%z %q %.8i %.9P %.8j %.8u %.8T %.10M %.9l" S:C:T QOS JOBID PARTITION NAME USER STATE TIME TIME_LIMI *:*:* normal 193441 prod Halo3 kamarasi PENDING 0:00 1-00:00:00 *:*:* normal 193440 prod Halo3 kamarasi PENDING 0:00 1-00:00:00 *:*:* normal 193439 prod Halo3 kamarasi PENDING 0:00 1-00:00:00 .... ie, no CPUs ("threads") have been requested? How can this be? The sbatch files in question look like #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=8 srun -n 1 <command> and #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=20 srun -n 1 <command> Ah. Is this the problem? Neither user has requested any CPUs. Only tasks. The docs for sbatch and srun don't mention a way to explicitly ask for threads-as-cpus, but there is a --cpus-per-task which we've never used, because the default is 1, which is what we wanted. So the accounting/priority/scheduling system hasn't accounted for that? Nope. When I do four tests with the following: 1. #SBATCH --cpus-per-task=1 2. srun -n 1 -c 1 <command> 3. #SBATCH --cpus-per-task=1 AND srun -n 1 -c 1 <command> 4. Setting the environment variable SLURM_CPUS_PER_TASK=1 None of which returned any values for S:C:T. I didn't continue with the permutations because I was getting the feeling that this wasn't the problem. Now I'm at a loss. Is it that using SLURM with threads as CPUs is the problem - it's not designed to work like that? So, the question remains. How do I effectively limit people from running more than X CPUs worth of jobs simultaneously? Or, alternatively, what have I done wrong setting up QOS so that this might happen? The relevant configuration details are below. Slurm conf defines: SelectType=select/cons_res SelectTypeParameters=CR_CPU AccountingStorageEnforce=qos NodeName=stpr-res-compute[01-02] CPUs=40 RealMemory=385000 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN NodeName=papr-res-compute[01-09] CPUs=40 RealMemory=385000 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN NOTES: we chose qos because the MaxTRESPerUser isn't available to the Account object which would then allow for using "limits". Assigning GRPTres on a per Association basis would require touching/managing each association. Not impossible, but clunky cf using QOS on Partitions. sacctmgr defines: All human users belong to QOS normal, sacctmgr show qos: sacctmgr show qos format=Name,Priority,PreemptMode,UsageFactor,MaxTRESPerUser Name Priority PreemptMode UsageFactor MaxTRESPU ---------- ---------- ----------- ----------- ------------- normal 10 cluster 1.000000 cpu=90 firstclass 100 cluster 1.000000 sinfo shows: $ sinfo -o "%18n %9P %.11T %.4c %.8z %.6m %C" HOSTNAMES PARTITION STATE CPUS S:C:T MEMORY CPUS(A/I/O/T) papr-res-compute08 pipeline idle 40 2:10:2 385000 0/40/0/40 papr-res-compute09 pipeline idle 40 2:10:2 385000 0/40/0/40 papr-res-compute08 bcl2fastq idle 40 2:10:2 385000 0/40/0/40 papr-res-compute08 pathology idle 40 2:10:2 385000 0/40/0/40 papr-res-compute09 pathology idle 40 2:10:2 385000 0/40/0/40 papr-res-compute02 prod* mixed 40 2:10:2 385000 36/4/0/40 papr-res-compute03 prod* mixed 40 2:10:2 385000 36/4/0/40 papr-res-compute04 prod* mixed 40 2:10:2 385000 36/4/0/40 papr-res-compute05 prod* mixed 40 2:10:2 385000 36/4/0/40 papr-res-compute01 prod* allocated 40 2:10:2 385000 40/0/0/40 papr-res-compute06 prod* allocated 40 2:10:2 385000 40/0/0/40 papr-res-compute07 prod* allocated 40 2:10:2 385000 40/0/0/40 stpr-res-compute01 debug idle 40 2:10:2 385000 0/40/0/40 stpr-res-compute02 debug idle 40 2:10:2 385000 0/40/0/40 ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper