I started a thread on understand QOS, but quickly realised I had made a
fundamental error in my configuration. I fixed that problem last week.
(ref:
https://groups.google.com/forum/#!msg/slurm-devel/dqL30WwmrmU/SoOMHmRVDAAJ )

Despite these changes, the issue remains, so I would like to ask again,
with more background information and more analysis.


Desired scenario: That any one user can only ever have jobs adding up to 90
CPUs at a time. They can submit requests for more than this, but their
running jobs will max out at 90 and the rest of their jobs will be put in
queue. A CPU being defined as a thread in a system that has 2 sockets, each
with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo
on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs)

Current scenario: users are getting every CPU they have requested, blocking
other users from the partitions.

Our users are able to use 40 CPUs per node, so we know that every thread is
available as a consumable resource, as we wanted.

When I use sinfo -o %C, the results re per CPU utilization reflect that the
thread is being used as the CPU measure.

Yet, as noted above, when I do an squeue, I see that users have jobs
running with more than 90 CPUs in total.

squeue that shows allocated CPUs. Note that both running users have more
than 90 CPUS each (threads):

$ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
CPUS QOS         JOBID PARTITION     NAME     USER    STATE       TIME
TIME_LIMI
   8 normal     193424      prod    Halo3 kamarasi  PENDING       0:00
1-00:00:00
   8 normal     193423      prod    Halo3 kamarasi  PENDING       0:00
1-00:00:00
   8 normal     193422      prod    Halo3 kamarasi  PENDING       0:00
1-00:00:00

  20 normal     189360      prod MuVd_WGS lij@pete  RUNNING   23:49:15
6-00:00:00
  20 normal     189353      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
6-00:00:00
  20 normal     189354      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
6-00:00:00
  20 normal     189356      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
6-00:00:00
  20 normal     189358      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
6-00:00:00
   8 normal     193417      prod    Halo3 kamarasi  RUNNING       0:01
1-00:00:00
   8 normal     193416      prod    Halo3 kamarasi  RUNNING       0:18
1-00:00:00
   8 normal     193415      prod    Halo3 kamarasi  RUNNING       0:19
1-00:00:00
   8 normal     193414      prod    Halo3 kamarasi  RUNNING       0:47
1-00:00:00
   8 normal     193413      prod    Halo3 kamarasi  RUNNING       2:08
1-00:00:00
   8 normal     193412      prod    Halo3 kamarasi  RUNNING       2:09
1-00:00:00
   8 normal     193411      prod    Halo3 kamarasi  RUNNING       3:24
1-00:00:00
   8 normal     193410      prod    Halo3 kamarasi  RUNNING       5:04
1-00:00:00
   8 normal     193409      prod    Halo3 kamarasi  RUNNING       5:06
1-00:00:00
   8 normal     193408      prod    Halo3 kamarasi  RUNNING       7:40
1-00:00:00
   8 normal     193407      prod    Halo3 kamarasi  RUNNING      10:48
1-00:00:00
   8 normal     193406      prod    Halo3 kamarasi  RUNNING      10:50
1-00:00:00
   8 normal     193405      prod    Halo3 kamarasi  RUNNING      11:34
1-00:00:00
   8 normal     193404      prod    Halo3 kamarasi  RUNNING      12:00
1-00:00:00
   8 normal     193403      prod    Halo3 kamarasi  RUNNING      12:10
1-00:00:00
   8 normal     193402      prod    Halo3 kamarasi  RUNNING      12:21
1-00:00:00
   8 normal     193401      prod    Halo3 kamarasi  RUNNING      12:40
1-00:00:00
   8 normal     193400      prod    Halo3 kamarasi  RUNNING      17:02
1-00:00:00
   8 normal     193399      prod    Halo3 kamarasi  RUNNING      21:03
1-00:00:00
   8 normal     193396      prod    Halo3 kamarasi  RUNNING      22:01
1-00:00:00
   8 normal     193394      prod    Halo3 kamarasi  RUNNING      23:40
1-00:00:00
   8 normal     193393      prod    Halo3 kamarasi  RUNNING      25:21
1-00:00:00
   8 normal     193390      prod    Halo3 kamarasi  RUNNING      25:58
1-00:00:00


Yet when I run squeue that shows Sockets/Cores/Threads as S/C/T:
squeue -o "%z %q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"

S:C:T QOS    JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI
*:*:* normal   193441      prod    Halo3 kamarasi  PENDING       0:00
1-00:00:00
*:*:* normal   193440      prod    Halo3 kamarasi  PENDING       0:00
1-00:00:00
*:*:* normal   193439      prod    Halo3 kamarasi  PENDING       0:00
1-00:00:00
....

ie, no CPUs ("threads") have been requested?

How can this be?

The sbatch files in question look like

 #!/bin/bash
 #SBATCH --nodes=1
 #SBATCH --ntasks=8
srun -n 1 <command>

and

 #!/bin/bash
 #SBATCH --nodes=1
 #SBATCH --ntasks=20
srun -n 1 <command>

Ah. Is this the problem? Neither user has requested any CPUs. Only tasks.
The docs for sbatch and srun don't mention a way to explicitly ask for
threads-as-cpus, but there is a  --cpus-per-task which we've never used,
because the default is 1, which is what we wanted. So the
accounting/priority/scheduling system hasn't accounted for that?

Nope. When I do four tests with the following:

1. #SBATCH --cpus-per-task=1
2. srun -n 1 -c 1 <command>
3. #SBATCH --cpus-per-task=1 AND srun -n 1 -c 1 <command>
4. Setting the environment variable SLURM_CPUS_PER_TASK=1

None of which returned any values for S:C:T. I didn't continue with the
permutations because I was getting the feeling that this wasn't the problem.

Now I'm at a loss. Is it that using SLURM with threads as CPUs is the
problem - it's not designed to work like that?

So, the question remains. How do I effectively limit people from running
more than X CPUs worth of jobs simultaneously? Or, alternatively, what have
I done wrong setting up QOS so that this might happen?

The relevant configuration details are below.

Slurm conf defines:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU

AccountingStorageEnforce=qos

NodeName=stpr-res-compute[01-02] CPUs=40 RealMemory=385000 Sockets=2
CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
NodeName=papr-res-compute[01-09] CPUs=40 RealMemory=385000 Sockets=2
CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN

NOTES: we chose qos because the MaxTRESPerUser isn't available to the
Account object which would then allow for using "limits". Assigning GRPTres
on a per Association basis would require touching/managing each
association. Not impossible, but clunky cf using QOS on Partitions.

sacctmgr defines:

All human users belong to QOS normal, sacctmgr show qos:

sacctmgr show qos
format=Name,Priority,PreemptMode,UsageFactor,MaxTRESPerUser
      Name   Priority PreemptMode UsageFactor     MaxTRESPU
---------- ---------- ----------- ----------- -------------
    normal         10     cluster    1.000000        cpu=90
firstclass        100     cluster    1.000000


sinfo shows:

$ sinfo -o "%18n %9P %.11T %.4c %.8z %.6m %C"
HOSTNAMES          PARTITION       STATE CPUS    S:C:T MEMORY CPUS(A/I/O/T)
papr-res-compute08 pipeline         idle   40   2:10:2 385000 0/40/0/40
papr-res-compute09 pipeline         idle   40   2:10:2 385000 0/40/0/40
papr-res-compute08 bcl2fastq        idle   40   2:10:2 385000 0/40/0/40
papr-res-compute08 pathology        idle   40   2:10:2 385000 0/40/0/40
papr-res-compute09 pathology        idle   40   2:10:2 385000 0/40/0/40
papr-res-compute02 prod*           mixed   40   2:10:2 385000 36/4/0/40
papr-res-compute03 prod*           mixed   40   2:10:2 385000 36/4/0/40
papr-res-compute04 prod*           mixed   40   2:10:2 385000 36/4/0/40
papr-res-compute05 prod*           mixed   40   2:10:2 385000 36/4/0/40
papr-res-compute01 prod*       allocated   40   2:10:2 385000 40/0/0/40
papr-res-compute06 prod*       allocated   40   2:10:2 385000 40/0/0/40
papr-res-compute07 prod*       allocated   40   2:10:2 385000 40/0/0/40
stpr-res-compute01 debug            idle   40   2:10:2 385000 0/40/0/40
stpr-res-compute02 debug            idle   40   2:10:2 385000
0/40/0/40



------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

Reply via email to