See below…... > On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > wrote: > > On 24-04-2023 18:33, Hoot Thompson wrote: >> In my reading of the Slurm documentation, it seems that exceeding the limits >> set in GrpTRESMins should result in terminating a running job. However, in >> testing this, The ‘current value’ of the GrpTRESMins only updates upon job >> completion and is not updated as the job progresses. Therefore jobs aren’t >> being stopped. On the positive side, no new jobs are started if the limit is >> exceeded. Here’s the documentation that is confusing me….. > > I think the jobs resource usage will only be added to the Slurm database upon > job completion. I believe that Slurm doesn't update the resource usage > continually as you seem to expect. > >> If any limit is reached, all running jobs with that TRES in this group will >> be killed, and no new jobs will be allowed to run. >> Perhaps there is a setting or misconfiguration on my part. > > The sacctmgr manual page states: > >> GrpTRESMins=TRES=<minutes>[,TRES=<minutes>,...] >> The total number of TRES minutes that can possibly be used by past, present >> and future jobs running from this association and its children. To clear a >> previously set value use the modify command with a new value of -1 for each >> TRES id. >> NOTE: This limit is not enforced if set on the root association of a >> cluster. So even though it may appear in sacctmgr output, it will not be >> enforced. >> ALSO NOTE: This limit only applies when using the Priority Multifactor >> plugin. The time is decayed using the value of PriorityDecayHalfLife or >> PriorityUsageResetPeriod as set in the slurm.conf. When this limit is >> reached all associated jobs running will be killed and all future jobs >> submitted with associations in the group will be delayed until they are able >> to run inside the limit. > > Can you please confirm that you have configured the "Priority Multifactor" > plugin? Here’s relevant items from slurm.conf
# Activate the Multifactor Job Priority Plugin with decay PriorityType=priority/multifactor # apply no decay PriorityDecayHalfLife=0 # reset usage after 1 month PriorityUsageResetPeriod=MONTHLY # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the # queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multifactor Job Priority Plugin. # The default value for each of the following is 1. PriorityWeightAge=1000 PriorityWeightFairshare=10000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=0 # don't use the qos factor > > Your jobs should not be able to start if the user's GrpTRESMins has been > exceeded. Hence they won't be killed! Yes, this works fine > > Can you explain step by step what you observe? It may be that the above > documentation of killing jobs is in error, in which case we should make a bug > report to SchedMD. I set the GrpTRESMins limit to a very small number and then ran a sleep job that exceeded the limit. The job continued to run past the limits until I killed it. It was the only job in the queue. And if it makes any difference, this testing is being done in AWS on a parallel cluster. > > /Ole > > > >