[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

Douglas Jacobsen Mon, 03 Oct 2016 05:26:45 -0700

Hi Lachlan,

You mentioned your slurm.conf has:
AccountingStorageEnforce=qos


The "qos" restriction only enforces that a user is authorized to use a
particular qos (in the qos string of the association in the slurm
database).  To enforce limits, you need to also use limits.  If you want to
prevent partial jobs from running and potentially being killed when a
resource runs out (only applicable for certain limits), you might also
consider setting "safe", e.g.,

AccountingStorageEnforce=limits,safe,qos

http://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageEnforce

I hope that helps,
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacob...@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Sun, Oct 2, 2016 at 9:08 PM, Lachlan Musicman <data...@gmail.com> wrote:

> I started a thread on understand QOS, but quickly realised I had made a
> fundamental error in my configuration. I fixed that problem last week.
> (ref: https://groups.google.com/forum/#!msg/slurm-devel/
> dqL30WwmrmU/SoOMHmRVDAAJ )
>
> Despite these changes, the issue remains, so I would like to ask again,
> with more background information and more analysis.
>
>
> Desired scenario: That any one user can only ever have jobs adding up to
> 90 CPUs at a time. They can submit requests for more than this, but their
> running jobs will max out at 90 and the rest of their jobs will be put in
> queue. A CPU being defined as a thread in a system that has 2 sockets, each
> with 10 cores, each core with 2 threads. (ie, when I do cat /proc/cpuinfo
> on any node, it reports 40 CPUs, so we configured to utilize 40 CPUs)
>
> Current scenario: users are getting every CPU they have requested,
> blocking other users from the partitions.
>
> Our users are able to use 40 CPUs per node, so we know that every thread
> is available as a consumable resource, as we wanted.
>
> When I use sinfo -o %C, the results re per CPU utilization reflect that
> the thread is being used as the CPU measure.
>
> Yet, as noted above, when I do an squeue, I see that users have jobs
> running with more than 90 CPUs in total.
>
> squeue that shows allocated CPUs. Note that both running users have more
> than 90 CPUS each (threads):
>
> $ squeue -o"%.4C %8q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
> CPUS QOS         JOBID PARTITION     NAME     USER    STATE       TIME
> TIME_LIMI
>    8 normal     193424      prod    Halo3 kamarasi  PENDING       0:00
> 1-00:00:00
>    8 normal     193423      prod    Halo3 kamarasi  PENDING       0:00
> 1-00:00:00
>    8 normal     193422      prod    Halo3 kamarasi  PENDING       0:00
> 1-00:00:00
>
>   20 normal     189360      prod MuVd_WGS lij@pete  RUNNING   23:49:15
> 6-00:00:00
>   20 normal     189353      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>   20 normal     189354      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>   20 normal     189356      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>   20 normal     189358      prod MuVd_WGS lij@pete  RUNNING 4-18:43:26
> 6-00:00:00
>    8 normal     193417      prod    Halo3 kamarasi  RUNNING       0:01
> 1-00:00:00
>    8 normal     193416      prod    Halo3 kamarasi  RUNNING       0:18
> 1-00:00:00
>    8 normal     193415      prod    Halo3 kamarasi  RUNNING       0:19
> 1-00:00:00
>    8 normal     193414      prod    Halo3 kamarasi  RUNNING       0:47
> 1-00:00:00
>    8 normal     193413      prod    Halo3 kamarasi  RUNNING       2:08
> 1-00:00:00
>    8 normal     193412      prod    Halo3 kamarasi  RUNNING       2:09
> 1-00:00:00
>    8 normal     193411      prod    Halo3 kamarasi  RUNNING       3:24
> 1-00:00:00
>    8 normal     193410      prod    Halo3 kamarasi  RUNNING       5:04
> 1-00:00:00
>    8 normal     193409      prod    Halo3 kamarasi  RUNNING       5:06
> 1-00:00:00
>    8 normal     193408      prod    Halo3 kamarasi  RUNNING       7:40
> 1-00:00:00
>    8 normal     193407      prod    Halo3 kamarasi  RUNNING      10:48
> 1-00:00:00
>    8 normal     193406      prod    Halo3 kamarasi  RUNNING      10:50
> 1-00:00:00
>    8 normal     193405      prod    Halo3 kamarasi  RUNNING      11:34
> 1-00:00:00
>    8 normal     193404      prod    Halo3 kamarasi  RUNNING      12:00
> 1-00:00:00
>    8 normal     193403      prod    Halo3 kamarasi  RUNNING      12:10
> 1-00:00:00
>    8 normal     193402      prod    Halo3 kamarasi  RUNNING      12:21
> 1-00:00:00
>    8 normal     193401      prod    Halo3 kamarasi  RUNNING      12:40
> 1-00:00:00
>    8 normal     193400      prod    Halo3 kamarasi  RUNNING      17:02
> 1-00:00:00
>    8 normal     193399      prod    Halo3 kamarasi  RUNNING      21:03
> 1-00:00:00
>    8 normal     193396      prod    Halo3 kamarasi  RUNNING      22:01
> 1-00:00:00
>    8 normal     193394      prod    Halo3 kamarasi  RUNNING      23:40
> 1-00:00:00
>    8 normal     193393      prod    Halo3 kamarasi  RUNNING      25:21
> 1-00:00:00
>    8 normal     193390      prod    Halo3 kamarasi  RUNNING      25:58
> 1-00:00:00
>
>
> Yet when I run squeue that shows Sockets/Cores/Threads as S/C/T:
> squeue -o "%z %q %.8i %.9P %.8j %.8u %.8T %.10M %.9l"
>
> S:C:T QOS    JOBID PARTITION     NAME     USER    STATE       TIME
> TIME_LIMI
> *:*:* normal   193441      prod    Halo3 kamarasi  PENDING       0:00
> 1-00:00:00
> *:*:* normal   193440      prod    Halo3 kamarasi  PENDING       0:00
> 1-00:00:00
> *:*:* normal   193439      prod    Halo3 kamarasi  PENDING       0:00
> 1-00:00:00
> ....
>
> ie, no CPUs ("threads") have been requested?
>
> How can this be?
>
> The sbatch files in question look like
>
>  #!/bin/bash
>  #SBATCH --nodes=1
>  #SBATCH --ntasks=8
> srun -n 1 <command>
>
> and
>
>  #!/bin/bash
>  #SBATCH --nodes=1
>  #SBATCH --ntasks=20
> srun -n 1 <command>
>
> Ah. Is this the problem? Neither user has requested any CPUs. Only tasks.
> The docs for sbatch and srun don't mention a way to explicitly ask for
> threads-as-cpus, but there is a  --cpus-per-task which we've never used,
> because the default is 1, which is what we wanted. So the
> accounting/priority/scheduling system hasn't accounted for that?
>
> Nope. When I do four tests with the following:
>
> 1. #SBATCH --cpus-per-task=1
> 2. srun -n 1 -c 1 <command>
> 3. #SBATCH --cpus-per-task=1 AND srun -n 1 -c 1 <command>
> 4. Setting the environment variable SLURM_CPUS_PER_TASK=1
>
> None of which returned any values for S:C:T. I didn't continue with the
> permutations because I was getting the feeling that this wasn't the problem.
>
> Now I'm at a loss. Is it that using SLURM with threads as CPUs is the
> problem - it's not designed to work like that?
>
> So, the question remains. How do I effectively limit people from running
> more than X CPUs worth of jobs simultaneously? Or, alternatively, what have
> I done wrong setting up QOS so that this might happen?
>
> The relevant configuration details are below.
>
> Slurm conf defines:
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU
>
> AccountingStorageEnforce=qos
>
> NodeName=stpr-res-compute[01-02] CPUs=40 RealMemory=385000 Sockets=2
> CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
> NodeName=papr-res-compute[01-09] CPUs=40 RealMemory=385000 Sockets=2
> CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
>
> NOTES: we chose qos because the MaxTRESPerUser isn't available to the
> Account object which would then allow for using "limits". Assigning GRPTres
> on a per Association basis would require touching/managing each
> association. Not impossible, but clunky cf using QOS on Partitions.
>
> sacctmgr defines:
>
> All human users belong to QOS normal, sacctmgr show qos:
>
> sacctmgr show qos format=Name,Priority,PreemptMode,UsageFactor,
> MaxTRESPerUser
>       Name   Priority PreemptMode UsageFactor     MaxTRESPU
> ---------- ---------- ----------- ----------- -------------
>     normal         10     cluster    1.000000        cpu=90
> firstclass        100     cluster    1.000000
>
>
> sinfo shows:
>
> $ sinfo -o "%18n %9P %.11T %.4c %.8z %.6m %C"
> HOSTNAMES          PARTITION       STATE CPUS    S:C:T MEMORY CPUS(A/I/O/T)
> papr-res-compute08 pipeline         idle   40   2:10:2 385000 0/40/0/40
> papr-res-compute09 pipeline         idle   40   2:10:2 385000 0/40/0/40
> papr-res-compute08 bcl2fastq        idle   40   2:10:2 385000 0/40/0/40
> papr-res-compute08 pathology        idle   40   2:10:2 385000 0/40/0/40
> papr-res-compute09 pathology        idle   40   2:10:2 385000 0/40/0/40
> papr-res-compute02 prod*           mixed   40   2:10:2 385000 36/4/0/40
> papr-res-compute03 prod*           mixed   40   2:10:2 385000 36/4/0/40
> papr-res-compute04 prod*           mixed   40   2:10:2 385000 36/4/0/40
> papr-res-compute05 prod*           mixed   40   2:10:2 385000 36/4/0/40
> papr-res-compute01 prod*       allocated   40   2:10:2 385000 40/0/0/40
> papr-res-compute06 prod*       allocated   40   2:10:2 385000 40/0/0/40
> papr-res-compute07 prod*       allocated   40   2:10:2 385000 40/0/0/40
> stpr-res-compute01 debug            idle   40   2:10:2 385000 0/40/0/40
> stpr-res-compute02 debug            idle   40   2:10:2 385000
> 0/40/0/40
>
>
>
> ------
> The most dangerous phrase in the language is, "We've always done it this
> way."
>
> - Grace Hopper
>

[slurm-dev] Re: QOS, Limits, CPUs and threads - something is wrong?

Reply via email to