Hi again, Let me back up and explain what we are trying to do, maybe there’s a better way to do it...
We have three partitions set up in Slurm currently: - ‘batch’ : this is the regular everyday partition folks can use to submit jobs; it is set as the default partition, and has a 5-hour maximum job runtime limit. - ‘long’ : this partition is designed to be used for long-running jobs; there is no max job time-limit set, but we want to set a restriction on how many jobs (and/or maybe CPUs) that a given user’s job submission can run (use) concurrently. - ‘scavenger’ : this partition is designed to be used for low-priority (most probably long-running) jobs; there is no max job time-limit set, but any job submitted by the prior two partitions that needs resources that are being used by the scavenger partition should “bump” the scavenger jobs, which will go back into the queue to be re-run. Here are their definitions in slurm.conf: # PARTITIONS PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off State=UP PartitionName=long Nodes=[nodelist] Default=NO DefMemPerCPU=2048 DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=100 PreemptMode=off State=UP PartitionName=scavenger Nodes=[nodelist] Default=NO DefMemPerCPU=2048 DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=10 PreemptMode=requeue State=UP Considering the ‘long’ partition, what is the best way to set up limits of how many jobs can be submitted to it concurrently by a user, or how to limit number of CPUs used? As can be seen from my prior post, we are utilizing job accounting via slurmdbd. Thanks, Will From: Will Dennis Sent: Friday, March 10, 2017 1:56 PM To: slurm-dev Cc: Lyn Gerner Subject: Re: [slurm-dev] MaxJobs on association not being respected I currently have this set in slurm.conf as: AccountingStorageEnforce=limits On Mar 10, 2017, at 1:53 PM, Lyn Gerner <schedulerqu...@gmail.com> wrote: Hey Will, Check to make sure you have selected the correct value for AccountingStorageEnforce. Sounds like it may be that. Best of luck, Lyn ---------- Forwarded message ---------- From: Will Dennis <wden...@nec-labs.com> Date: Fri, Mar 10, 2017 at 8:30 AM Subject: [slurm-dev] MaxJobs on association not being respected To: slurm-dev <slurm-dev@schedmd.com> Hi all, Generally new to Slurm here, so please forgive any ignorance... We have a test cluster (three compute nodes) running Slurm 16.05.4 in operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and have set up associations for the users on partitions of the cluster, as follows: [root@ml43 ~]# sacctmgr show associations Cluster Account User Partition Share GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS Def QOS GrpTRESRunMin ---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- ------------- ml-cluster root 1 normal ml-cluster root root 1 normal ml-cluster ml 1 normal ml-cluster ml alex scavenger 1 normal ml-cluster ml alex batch 1 normal ml-cluster ml alex long 1 1 normal ml-cluster ml iain scavenger 1 normal ml-cluster ml iain batch 1 normal ml-cluster ml iain long 1 normal As you may notice, we have set up a “MaxJobs” limit of “1" for the ‘alex’ user on the ‘long’ partition. What we want to do is enforce a maximum of one job running at a time per user for the ‘long’ partition. However, when the user ‘alex’ submitted a number of jobs to this partition, all of them ran: [root@ml43 ~]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 324 long tmp.sh alex PD 0:00 1 (Resources) 321 long tmp.sh alex R 1:56 1 ml46 323 long tmp.sh alex R 0:33 1 ml53 322 long tmp.sh alex R 0:36 1 ml48 From the output of “share” we verified the right queue got the job: [root@ml43 ~]# sshare -am Account User Partition RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ------------ ---------- ----------- ----------- ------------- ---------- root 1.000000 7977 1.000000 0.500000 root root 1 0.500000 0 0.000000 1.000000 ml 1 0.500000 7977 1.000000 0.250000 ml alex scavenger 1 0.083333 0 0.166667 0.250000 ml alex batch 1 0.083333 0 0.166667 0.250000 ml alex long 1 0.083333 7977 1.000000 0.000244 ml iain scavenger 1 0.083333 0 0.166667 0.250000 ml iain batch 1 0.083333 0 0.166667 0.250000 ml iain long 1 0.083333 0 0.166667 0.250000 Why doesn’t the “MaxJobs” limit prevent the running of more than one job at a time for this user? Thanks, Will