Danny Auble wrote on Sep, 11 13:00:52: > Thanks for reporting this Chris. Here is a patch > (a9fa0d91f7d7ee05fd3aca4616db364f68ee0624) that fixes this.
I'm not sure I would have called that a bug. In my reading of the man page MaxNodes/MaxCPUs work as advertised. As did GrpNodes/GrpCPUs. > > Thanks, > Danny > > On 09/10/12 10:55, Chris Scheller wrote: >> Andy Wettstein wrote on Sep, 10 09:01:02: >>> On Sat, Sep 08, 2012 at 06:22:03PM -0600, Chris Scheller wrote: >>>> Andy Wettstein wrote on Sep, 07 14:33:05: >>>>> Hi, >>>>> >>>>> I'm seeing an issue with the QoS limits not being enforced. I am using >>>>> slrum 2.4. On the normal QoS I've got MaxCPUsPerUser=1024 and >>>>> MaxNodesPerUser=64. Those are the only limits besides MaxWall. There is >>>> I believe those are per job limits. You want to use the GrpCPUs and >>>> GrpNodes options instead. >>> That's not my understanding from the manual. From what I can tell >>> MaxNodes and MaxCPUs is enforced per job >>> MaxNodesPerUser and MaxCPUsPerUser is enforced for the user >>> GrpNodes and GrpCPUs is enforced for the qos >> True unless you apply the grpcpus/grpnodes to the user association >> level. I do this to limit the total number of cores a single user can >> use overall their jobs. Kinda annoying to have to apply to the user >> level but has the intended effect. >> >>> AccountingStorageEnforce=limits,qos is set in the slurm.conf. >>> >>> I was just now able to understand how to reproduce this. It looks like I >>> can exceed the per user limits as long as my current jobs are under the >>> limits and my next to start exceeds them. >>> >>> This will help understand the problem I think: >>> >>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >>> ??????[$] <> sbatch -N 63 hello1.sh >>> Submitted batch job 1732073 >>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >>> ??????[$] <> sbatch -N 2 hello1.sh >>> Submitted batch job 1732074 >>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >>> ??????[$] <> sbatch -N 2 hello1.sh >>> Submitted batch job 1732075 >>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16] >>> ??????[$] <> squeue -u wettstein >>> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) >>> 1732075 sandyb hello1.s wettstei PD 0:00 2 (QOSResourceLimit) >>> 1732073 sandyb hello1.s wettstei R 0:08 63 >>> midway[043-044,046-047,050,053-074,077-093,095,097,102-103,105-112,115,119-124] >>> 1732074 sandyb hello1.s wettstei R 0:04 2 midway[043-044] >>> >>> >>> The second job started and I was able to exceed the MaxNodesPerUser=64 >>> limit. The third job didn't start because I was already over the limit. >>> It seems like the limit checking might not be taking into account the >>> number of nodes requested for the job that is being started. > -- Chris Scheller Unix System Administrator Department of Biostatistics School of Public Health University of Michigan Phone: (734) 615-7439 Office: M4218
