Hi, After some fun incidents with accidental monopolization of the cluster, we decided to enforce some QOS.
I read the documentation. Thus far in the set up the only thing I've done that's even close is I assigned "share" values when I set up each association. The cluster had a QOS called normal. I adjusted normal to have MaxTRESPerUser=cpu=90, create a new QOS called firstclass for a special set of associations that need better access. sacctmgr show qos format=Name,Priority,PreemptMode,UsageFactor,MaxTRESPerUser Name Priority PreemptMode UsageFactor MaxTRESPU ---------- ---------- ----------- ----------- ------------- normal 10 cluster 1.000000 cpu=90 firstclass 100 cluster 1.000000 sacctmgr show assoc format=Cluster,Account,User,Partition,QOS Cluster Account User Partition QOS ---------- ---------- ---------- ---------- -------------------- rosalind dev normal rosalind dev test_user debug normal rosalind dev test_user bcl2fastq normal rosalind pipeline firstclass rosalind pathology firstclass rosalind pathology bioinf pipeline firstclass rosalind pathology bioinf bcl2fastq firstclass rosalind pathology bioinf pathology firstclass rosalind research firstclass rosalind research bioinf pathology firstclass rosalind research bioinf pipeline firstclass rosalind research bioinf bcl2fastq firstclass rosalind reynolds normal rosalind reynolds ysun@pete+ prod normal rosalind reynolds ysun@pete+ debug normal rosalind users normal rosalind bacg normal rosalind bacg akumar@pe+ debug normal rosalind bacg akumar@pe+ prod normal rosalind bacg apapenfus+ debug normal rosalind bacg apapenfus+ prod normal rosalind bacg dgoode@pe+ debug normal rosalind bacg dgoode@pe+ prod normal rosalind bacg ivergara@+ debug normal rosalind bacg ivergara@+ prod normal rosalind bacg jmarkham@+ debug normal etc Then I assign the firstclass to those that need it (on different partitions), adjusted the slurm.conf accordingly (confirmed that PriorityType was multifactor, made PriorityWeightQOS=1000), distributed to all nodes, restarted slurmctld and did scontrol reconfigure. Yet the max cpu doesn't seem to have propagated? Almost immediately someone used more than 90 cpus. What have I done wrong? I re-read the documentation this AM, but I can't see anything that might be preventing QOS from being applied except for maybe a qos hierarchy issue, but I've only set the two qos and they apply to distinct associations and partitions. cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper