Did anyone ever answer this? We're having a similar issue, though we're on
16.05. This used to work for us, but now the GrpCPUMins limit doesn't seem to
be enforced at all. I'm not sure when it stopped working but clearly must have
related to some config change we did.
Here are a couple of relevant lines from the slurm.conf:
AccountingStorageEnforce=associations,limits,qos,safe
AccountingStorageType=accounting_storage/slurmdbd
In our setup, each allocation account has a corresponding QOS, and the
GrpCPUMins value is limited in the QOS. Here's an example where I set the
limit very low for testing:
$ sacctmgr show qos pwiegand format=Name,GrpTRESMins
Name GrpTRESMins
---------- -------------
pwiegand cpu=5820
But the limit is not enforced for anyway, and users regularly use more
resources than they are allocated. Here is an example for the above user:
$ sreport cluster AccountUtilizationByUser start=2016-09-01T00:00:00
end=2016-10-01T00:00:00 Accounts=pwiegand -t Minutes Format=Account,Login,Used
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2016-09-01T00:00:00 - 2016-09-01T12:59:59
(46800 secs)
Use reported in TRES Minutes
--------------------------------------------------------------------------------
Account Login Used
--------------- --------- --------
pwiegand 6262
pwiegand pwiegand 6262
Is there something obvious I am doing wrong?
Thanks,
Paul.
> On May 31, 2016, at 10:49, Jens Svalgaard Kohrt <[email protected]> wrote:
>
> Hi
>
> We’ve just updated our slurm installation to 15.08.11 from 14.11 and have run
> into that GrpCPUMins/GrpTRESMins no longer seems to be enforced.
> Has anybody else had this problem?
>
> Reading the release notes from 15.08.11 nothing in particular springs in mind.
> Do we need to do something special to get the GrpCPUMins/GrpTRESMins
> enforcement setup right?
>
> The documentation <http://slurm.schedmd.com/resource_limits.html> mentions
> both GrpCPUMins and GrpTRESMins, but I guess that GrpCPUMins is only there
> for historical reasons...
>
> An example:
> I can submit a four node = 96 core, 24 hour job, even when the account only
> has less than 1440 cpu minutes left, i.e., less than 1 node hour.
> Previously the job would have been rejected or left in the
> jobstate=AssocGrpCPUMinsLimit
>
> [root@slurm1 slurm]#scontrol show job 326248
> JobId=326248 JobName=run.sh
> UserId=svalle(6003) GroupId=sdu(3010)
> Priority=7411 Nice=0 Account=sdutest01_slim QOS=slim WCKey=*
> JobState=PENDING Reason=Priority Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
> SubmitTime=2016-05-31T15:43:00 EligibleTime=2016-05-31T15:43:00
> StartTime=2016-05-31T20:44:58 EndTime=Unknown
> PreemptTime=None SuspendTime=None SecsPreSuspend=0
> Partition=slim AllocNode:Sid=fe2:439
> ReqNodeList=(null) ExcNodeList=(null)
> NodeList=(null) SchedNodeList=s51p[21,24-25,28]
> NumNodes=4-4 NumCPUs=96 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> TRES=cpu=96,node=4
> Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
> MinCPUsNode=24 MinMemoryNode=0 MinTmpDiskNode=0
> Features=(null) Gres=(null) Reservation=(null)
> Shared=0 Contiguous=0 Licenses=(null) Network=(null)
> Command=/gpfs/gss1/work/sdutest01/svalle/run.sh
> WorkDir=/gpfs/gss1/work/sdutest01/svalle
> StdErr=/gpfs/gss1/work/sdutest01/svalle/slurm-326248.out
> StdIn=/dev/null
> StdOut=/gpfs/gss1/work/sdutest01/svalle/slurm-326248.out
> Power= SICP=0
>
>
>
> [root@slurm1 slurm]# sshare -u svalle -l
> Account User RawShares NormShares RawUsage
> NormUsage EffectvUsage FairShare LevelFS GrpTRESMins
> TRESRunMins
> -------------------- ---------- ---------- ----------- -----------
> ----------- ------------- ---------- ----------
> ------------------------------ ------------------------------
> root 0.000000 14149970808
> 1.000000
> cpu=14155926,mem=0,energy=0,n+
>
> ...
> sdutest01_slim 1 0.000000 384
> 0.000000 0.000000 12.275540 cpu=1440
> cpu=0,mem=0,energy=0,node=0
> sdutest01_slim svalle 100 0.142857 384
> 0.000000 1.000000 0.739785 0.142857
> cpu=0,mem=0,energy=0,node=0
> ...
>
>
> [root@slurm1 tmp]# scontrol show config | grep Account
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = associations,limits,qos,safe,wckeys
> AccountingStorageHost = slurm1
> AccountingStorageLoc = N/A
> AccountingStoragePort = 6819
> AccountingStorageTRES = cpu,mem,energy,node
> AccountingStorageType = accounting_storage/slurmdbd
> AccountingStorageUser = N/A
> AccountingStoreJobComment = Yes
>
> [root@slurm1 slurm]# sacctmgr show qos
> Name Priority GraceTime Preempt PreemptMode
> Flags UsageThres UsageFactor GrpTRES GrpTRESMins
> GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode
> MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MinTRES
> ---------- ---------- ---------- ---------- -----------
> ---------------------------------------- ---------- ----------- -------------
> ------------- ------------- ------- --------- ----------- -------------
> -------------- ------------- ----------- ------------- --------- -----------
> -------------
> ...
> slim 0 00:00:00 cluster
> 1.000000
>
>
> [root@slurm1 ~]# sacctmgr dump deic
> ...
> Cluster - 'deic':Fairshare=1:QOS=‘normal'
> Parent - 'root'
> ...
> Account -
> 'sdutest01_slim':Description='sdutest01_slim':Organization='sdutest01':Fairshare=1:GrpTRESMins=cpu=1440:QOS=‘slim'
> ...
> Parent - 'sdutest01_slim'
> User - 'svalle':DefaultAccount='sysops_workq':Fairshare=100:QOS='+slim’
> ...
>
>
> Thanks,
>
> Jens Svalgaard Kohrt
> DeIC Nationale HPC Center, SDU
> Syddansk Universitet
> sdu.dk/staff/jesk