[slurm-dev] RE: MaxJobs on association not being respected

Will Dennis Wed, 15 Mar 2017 10:13:44 -0700

Hi again,

Let me back up and explain what we are trying to do, maybe there’s a better way 
to do it...


We have three partitions set up in Slurm currently:

- ‘batch’ :  this is the regular everyday partition folks can use to submit 
jobs; it is set as the default partition, and has a 5-hour maximum job runtime 
limit.
- ‘long’ :  this partition is designed to be used for long-running jobs; there 
is no max job time-limit set, but we want to set a restriction on how many jobs 
(and/or maybe CPUs) that a given user’s job submission can run (use) 
concurrently.
- ‘scavenger’ :  this partition is designed to be used for low-priority (most 
probably long-running) jobs; there is no max job time-limit set, but any job 
submitted by the prior two partitions that needs resources that are being used 
by the scavenger partition should “bump” the scavenger jobs, which will go back 
into the queue to be re-run.

Here are their definitions in slurm.conf:

# PARTITIONS
PartitionName=batch Nodes=[nodelist] Default=YES DefMemPerCPU=2048 
DefaultTime=01:00:00 MaxTime=05:00:00 PriorityTier=100 PreemptMode=off State=UP
PartitionName=long Nodes=[nodelist] Default=NO DefMemPerCPU=2048 
DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=100 PreemptMode=off 
State=UP
PartitionName=scavenger Nodes=[nodelist] Default=NO DefMemPerCPU=2048 
DefaultTime=1-00:00:00 MaxTime=UNLIMITED PriorityTier=10 PreemptMode=requeue 
State=UP

Considering the ‘long’ partition, what is the best way to set up limits of how 
many jobs can be submitted to it concurrently by a user, or how to limit number 
of CPUs used? 

As can be seen from my prior post, we are utilizing job accounting via slurmdbd.

Thanks,
Will



From: Will Dennis 
Sent: Friday, March 10, 2017 1:56 PM
To: slurm-dev
Cc: Lyn Gerner
Subject: Re: [slurm-dev] MaxJobs on association not being respected

 

I currently have this set in slurm.conf as:

 

AccountingStorageEnforce=limits

 

 

On Mar 10, 2017, at 1:53 PM, Lyn Gerner <schedulerqu...@gmail.com> wrote:

 

Hey Will,

 

Check to make sure you have selected the correct value for 
AccountingStorageEnforce. Sounds like it may be that.




Best of luck,

Lyn




---------- Forwarded message ----------
From: Will Dennis <wden...@nec-labs.com>
Date: Fri, Mar 10, 2017 at 8:30 AM
Subject: [slurm-dev] MaxJobs on association not being respected
To: slurm-dev <slurm-dev@schedmd.com>


Hi all,


Generally new to Slurm here, so please forgive any ignorance...


We have a test cluster (three compute nodes) running Slurm 16.05.4 in 
operation, with the ‘multifactor’ scheduler in use. We have set up slurmdb, and 
have set up associations for the users on partitions of the cluster, as follows:


[root@ml43 ~]# sacctmgr show associations

   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES 
GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode 
MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- 
--------- ----------- ------------- ------- ------------- -------------- 
--------- ----------- ------------- -------------------- --------- -------------
ml-cluster       root                               1                           
                                                                                
                                       normal
ml-cluster       root       root                    1                           
                                                                                
                                       normal
ml-cluster         ml                               1                           
                                                                                
                                       normal
ml-cluster         ml       alex  scavenger         1                           
                                                                                
                                       normal
ml-cluster         ml       alex      batch         1                           
                                                                                
                                       normal
ml-cluster         ml       alex       long         1                           
                                      1                                         
                                       normal
ml-cluster         ml       iain  scavenger         1                           
                                                                                
                                       normal
ml-cluster         ml       iain      batch         1                           
                                                                                
                                       normal
ml-cluster         ml       iain       long         1                           
                                                                                
                                       normal


As you may notice, we have set up a “MaxJobs” limit of “1" for the ‘alex’ user 
on the ‘long’ partition. What we want to do is enforce a maximum of one job 
running at a time per user for the ‘long’ partition. However, when the user 
‘alex’ submitted a number of jobs to this partition, all of them ran:

[root@ml43 ~]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
               324      long   tmp.sh     alex PD       0:00      1 (Resources)
               321      long   tmp.sh     alex  R       1:56      1 ml46
               323      long   tmp.sh     alex  R       0:33      1 ml53
               322      long   tmp.sh     alex  R       0:36      1 ml48

From the output of “share” we verified the right queue got the job:

[root@ml43 ~]# sshare -am
             Account       User    Partition  RawShares  NormShares    RawUsage 
 EffectvUsage  FairShare
-------------------- ---------- ------------ ---------- ----------- ----------- 
------------- ----------
root                                                       1.000000        7977 
     1.000000   0.500000
 root                      root                       1    0.500000           0 
     0.000000   1.000000
 ml                                                   1    0.500000        7977 
     1.000000   0.250000
  ml                       alex    scavenger          1    0.083333           0 
     0.166667   0.250000
  ml                       alex        batch          1    0.083333           0 
     0.166667   0.250000
  ml                       alex         long          1    0.083333        7977 
     1.000000   0.000244
  ml                       iain    scavenger          1    0.083333           0 
     0.166667   0.250000
  ml                       iain        batch          1    0.083333           0 
     0.166667   0.250000
  ml                       iain         long          1    0.083333           0 
     0.166667   0.250000

Why doesn’t the “MaxJobs” limit prevent the running of more than one job at a 
time for this user?

Thanks,
Will

[slurm-dev] RE: MaxJobs on association not being respected

Reply via email to