Thanks for reporting this Chris. Here is a patch (a9fa0d91f7d7ee05fd3aca4616db364f68ee0624) that fixes this.
Thanks, Danny On 09/10/12 10:55, Chris Scheller wrote: > Andy Wettstein wrote on Sep, 10 09:01:02: >> On Sat, Sep 08, 2012 at 06:22:03PM -0600, Chris Scheller wrote: >>> Andy Wettstein wrote on Sep, 07 14:33:05: >>>> Hi, >>>> >>>> I'm seeing an issue with the QoS limits not being enforced. I am using >>>> slrum 2.4. On the normal QoS I've got MaxCPUsPerUser=1024 and >>>> MaxNodesPerUser=64. Those are the only limits besides MaxWall. There is >>> I believe those are per job limits. You want to use the GrpCPUs and >>> GrpNodes options instead. >> That's not my understanding from the manual. From what I can tell >> MaxNodes and MaxCPUs is enforced per job >> MaxNodesPerUser and MaxCPUsPerUser is enforced for the user >> GrpNodes and GrpCPUs is enforced for the qos > True unless you apply the grpcpus/grpnodes to the user association > level. I do this to limit the total number of cores a single user can > use overall their jobs. Kinda annoying to have to apply to the user > level but has the intended effect. > >> AccountingStorageEnforce=limits,qos is set in the slurm.conf. >> >> I was just now able to understand how to reproduce this. It looks like I >> can exceed the per user limits as long as my current jobs are under the >> limits and my next to start exceeds them. >> >> This will help understand the problem I think: >> >> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >> ??????[$] <> sbatch -N 63 hello1.sh >> Submitted batch job 1732073 >> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >> ??????[$] <> sbatch -N 2 hello1.sh >> Submitted batch job 1732074 >> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >> ??????[$] <> sbatch -N 2 hello1.sh >> Submitted batch job 1732075 >> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >> ??????[$] <> squeue -u wettstein >> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) >> 1732075 sandyb hello1.s wettstei PD 0:00 2 (QOSResourceLimit) >> 1732073 sandyb hello1.s wettstei R 0:08 63 >> midway[043-044,046-047,050,053-074,077-093,095,097,102-103,105-112,115,119-124] >> 1732074 sandyb hello1.s wettstei R 0:04 2 midway[043-044] >> >> >> The second job started and I was able to exceed the MaxNodesPerUser=64 >> limit. The third job didn't start because I was already over the limit. >> It seems like the limit checking might not be taking into account the >> number of nodes requested for the job that is being started.
