Hi,

After some fun incidents with accidental monopolization of the cluster, we
decided to enforce some QOS.

I read the documentation. Thus far in the set up the only thing I've done
that's even close is I assigned "share" values when I set up each
association.

The cluster had a QOS called normal.

I adjusted normal to have MaxTRESPerUser=cpu=90, create a new QOS called
firstclass for a special set of associations that need better access.

sacctmgr show qos
format=Name,Priority,PreemptMode,UsageFactor,MaxTRESPerUser

      Name   Priority PreemptMode UsageFactor     MaxTRESPU
---------- ---------- ----------- ----------- -------------
    normal         10     cluster    1.000000        cpu=90
firstclass        100     cluster    1.000000


sacctmgr show assoc format=Cluster,Account,User,Partition,QOS

   Cluster    Account       User  Partition                  QOS
---------- ---------- ---------- ---------- --------------------
  rosalind        dev                                     normal
  rosalind        dev  test_user      debug               normal
  rosalind        dev  test_user  bcl2fastq               normal
  rosalind   pipeline                                 firstclass
  rosalind  pathology                                 firstclass
  rosalind  pathology     bioinf   pipeline           firstclass
  rosalind  pathology     bioinf  bcl2fastq           firstclass
  rosalind  pathology     bioinf  pathology           firstclass
  rosalind   research                                 firstclass
  rosalind   research     bioinf  pathology           firstclass
  rosalind   research     bioinf   pipeline           firstclass
  rosalind   research     bioinf  bcl2fastq           firstclass
  rosalind   reynolds                                     normal
  rosalind   reynolds ysun@pete+       prod               normal
  rosalind   reynolds ysun@pete+      debug               normal
  rosalind      users                                     normal
  rosalind       bacg                                     normal
  rosalind       bacg akumar@pe+      debug               normal
  rosalind       bacg akumar@pe+       prod               normal
  rosalind       bacg apapenfus+      debug               normal
  rosalind       bacg apapenfus+       prod               normal
  rosalind       bacg dgoode@pe+      debug               normal
  rosalind       bacg dgoode@pe+       prod               normal
  rosalind       bacg ivergara@+      debug               normal
  rosalind       bacg ivergara@+       prod               normal
  rosalind       bacg jmarkham@+      debug               normal

etc


Then I assign the firstclass to those that need it (on different
partitions), adjusted the slurm.conf accordingly (confirmed that
PriorityType was multifactor, made PriorityWeightQOS=1000), distributed to
all nodes, restarted slurmctld and did scontrol reconfigure.



Yet the max cpu doesn't seem to have propagated? Almost immediately someone
used more than 90 cpus.

What have I done wrong? I re-read the documentation this AM, but I can't
see anything that might be preventing QOS from being applied except for
maybe a qos hierarchy issue, but I've only set the two qos and they apply
to distinct associations and partitions.

cheers
L.



------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

Reply via email to