Did anyone ever answer this?  We're having a similar issue, though we're on 
16.05.  This used to work for us, but now the GrpCPUMins limit doesn't seem to 
be enforced at all.  I'm not sure when it stopped working but clearly must have 
related to some config change we did.

Here are a couple of relevant lines from the slurm.conf:

AccountingStorageEnforce=associations,limits,qos,safe
AccountingStorageType=accounting_storage/slurmdbd


In our setup, each allocation account has a corresponding QOS, and the 
GrpCPUMins value is limited in the QOS.  Here's an example where I set the 
limit very low for testing:

$ sacctmgr show qos pwiegand format=Name,GrpTRESMins
      Name   GrpTRESMins 
---------- ------------- 
  pwiegand      cpu=5820 


But the limit is not enforced for anyway, and users regularly use more 
resources than they are allocated.  Here is an example for the above user:

$ sreport cluster AccountUtilizationByUser  start=2016-09-01T00:00:00  
end=2016-10-01T00:00:00 Accounts=pwiegand  -t Minutes  Format=Account,Login,Used
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2016-09-01T00:00:00 - 2016-09-01T12:59:59 
(46800 secs)
Use reported in TRES Minutes
--------------------------------------------------------------------------------
        Account     Login     Used 
--------------- --------- -------- 
       pwiegand               6262 
       pwiegand  pwiegand     6262 


Is there something obvious I am doing wrong?

Thanks,
Paul.


> On May 31, 2016, at 10:49, Jens Svalgaard Kohrt <[email protected]> wrote:
> 
> Hi
> 
> We’ve just updated our slurm installation to 15.08.11 from 14.11 and have run 
> into that GrpCPUMins/GrpTRESMins no longer seems to be enforced.
> Has anybody else had this problem?
> 
> Reading the release notes from 15.08.11 nothing in particular springs in mind.
> Do we need to do something special to get the GrpCPUMins/GrpTRESMins 
> enforcement setup right?
> 
> The documentation <http://slurm.schedmd.com/resource_limits.html> mentions 
> both GrpCPUMins and GrpTRESMins, but I guess that GrpCPUMins is only there 
> for historical reasons...
> 
> An example:
> I can submit a four node = 96 core, 24 hour job, even when the account only 
> has less than 1440 cpu minutes left, i.e., less than 1 node hour.
> Previously the job would have been rejected or left in the 
> jobstate=AssocGrpCPUMinsLimit
> 
> [root@slurm1 slurm]#scontrol show job 326248
> JobId=326248 JobName=run.sh
>    UserId=svalle(6003) GroupId=sdu(3010)
>    Priority=7411 Nice=0 Account=sdutest01_slim QOS=slim WCKey=*
>    JobState=PENDING Reason=Priority Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
>    SubmitTime=2016-05-31T15:43:00 EligibleTime=2016-05-31T15:43:00
>    StartTime=2016-05-31T20:44:58 EndTime=Unknown
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    Partition=slim AllocNode:Sid=fe2:439
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null) SchedNodeList=s51p[21,24-25,28]
>    NumNodes=4-4 NumCPUs=96 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=96,node=4
>    Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
>    MinCPUsNode=24 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) Gres=(null) Reservation=(null)
>    Shared=0 Contiguous=0 Licenses=(null) Network=(null)
>    Command=/gpfs/gss1/work/sdutest01/svalle/run.sh
>    WorkDir=/gpfs/gss1/work/sdutest01/svalle
>    StdErr=/gpfs/gss1/work/sdutest01/svalle/slurm-326248.out
>    StdIn=/dev/null
>    StdOut=/gpfs/gss1/work/sdutest01/svalle/slurm-326248.out
>    Power= SICP=0
> 
> 
> 
> [root@slurm1 slurm]# sshare -u svalle -l
>              Account       User  RawShares  NormShares    RawUsage   
> NormUsage  EffectvUsage  FairShare    LevelFS                    GrpTRESMins  
>                   TRESRunMins
> -------------------- ---------- ---------- ----------- ----------- 
> ----------- ------------- ---------- ---------- 
> ------------------------------ ------------------------------
> root                                          0.000000 14149970808            
>       1.000000                                                      
> cpu=14155926,mem=0,energy=0,n+
> 
> ...
>  sdutest01_slim                          1    0.000000         384    
> 0.000000      0.000000             12.275540                       cpu=1440   
>  cpu=0,mem=0,energy=0,node=0
>   sdutest01_slim         svalle        100    0.142857         384    
> 0.000000      1.000000   0.739785   0.142857                                  
>  cpu=0,mem=0,energy=0,node=0
> ...
> 
> 
> [root@slurm1 tmp]# scontrol show config | grep Account
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = associations,limits,qos,safe,wckeys
> AccountingStorageHost   = slurm1
> AccountingStorageLoc    = N/A
> AccountingStoragePort   = 6819
> AccountingStorageTRES   = cpu,mem,energy,node
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AccountingStoreJobComment = Yes
> 
> [root@slurm1 slurm]# sacctmgr show qos
>       Name   Priority  GraceTime    Preempt PreemptMode                       
>              Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins 
> GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   
> MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU       MinTRES
> ---------- ---------- ---------- ---------- ----------- 
> ---------------------------------------- ---------- ----------- ------------- 
> ------------- ------------- ------- --------- ----------- ------------- 
> -------------- ------------- ----------- ------------- --------- ----------- 
> -------------
> ...
>       slim          0   00:00:00                cluster                       
>                                  1.000000
> 
> 
> [root@slurm1 ~]# sacctmgr dump deic
> ...
> Cluster - 'deic':Fairshare=1:QOS=‘normal'
> Parent - 'root'
> ...
> Account - 
> 'sdutest01_slim':Description='sdutest01_slim':Organization='sdutest01':Fairshare=1:GrpTRESMins=cpu=1440:QOS=‘slim'
> ...
> Parent - 'sdutest01_slim'
> User - 'svalle':DefaultAccount='sysops_workq':Fairshare=100:QOS='+slim’
> ...
> 
> 
> Thanks,
> 
> Jens Svalgaard Kohrt
> DeIC Nationale HPC Center, SDU
> Syddansk Universitet
> sdu.dk/staff/jesk

Reply via email to