[slurm-dev] Re: QoS limit issues

Chris Scheller Mon, 10 Sep 2012 10:51:07 -0700

Andy Wettstein wrote on Sep, 10 09:01:02:
> 
> On Sat, Sep 08, 2012 at 06:22:03PM -0600, Chris Scheller wrote:
> > 
> > Andy Wettstein wrote on Sep, 07 14:33:05:
> > > 
> > > Hi,
> > > 
> > > I'm seeing an issue with the QoS limits not being enforced. I am using
> > > slrum 2.4. On the normal QoS I've got MaxCPUsPerUser=1024 and
> > > MaxNodesPerUser=64. Those are the only limits besides MaxWall. There is
> > 
> > I believe those are per job limits. You want to use the GrpCPUs and
> > GrpNodes options instead.
> 
> That's not my understanding from the manual. From what I can tell
> MaxNodes and MaxCPUs is enforced per job
> MaxNodesPerUser and MaxCPUsPerUser is enforced for the user
> GrpNodes and GrpCPUs is enforced for the qos


True unless you apply the grpcpus/grpnodes to the user association
level. I do this to limit the total number of cores a single user can
use overall their jobs. Kinda annoying to have to apply to the user
level but has the intended effect.

> 
> AccountingStorageEnforce=limits,qos is set in the slurm.conf.
> 
> I was just now able to understand how to reproduce this. It looks like I
> can exceed the per user limits as long as my current jobs are under the
> limits and my next to start exceeds them. 
> 
> This will help understand the problem I think:
> 
> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
> ??????[$] <> sbatch -N 63 hello1.sh
> Submitted batch job 1732073
> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
> ??????[$] <> sbatch -N 2 hello1.sh
> Submitted batch job 1732074
> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
> ??????[$] <> sbatch -N 2 hello1.sh
> Submitted batch job 1732075
> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
> ??????[$] <> squeue -u wettstein
>   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
> 1732075    sandyb hello1.s wettstei  PD       0:00      2 (QOSResourceLimit)
> 1732073    sandyb hello1.s wettstei   R       0:08     63 
> midway[043-044,046-047,050,053-074,077-093,095,097,102-103,105-112,115,119-124]
> 1732074    sandyb hello1.s wettstei   R       0:04      2 midway[043-044]
> 
> 
> The second job started and I was able to exceed the MaxNodesPerUser=64
> limit. The third job didn't start because I was already over the limit.
> It seems like the limit checking might not be taking into account the
> number of nodes requested for the job that is being started.
-- 
Chris Scheller
Unix System Administrator
Department of Biostatistics
School of Public Health
University of Michigan
Phone: (734) 615-7439
Office: M4218

[slurm-dev] Re: QoS limit issues

Reply via email to