[slurm-dev] Re: QoS limit issues

Chris Scheller Tue, 11 Sep 2012 14:02:09 -0700

Danny Auble wrote on Sep, 11 13:00:52:
> Thanks for reporting this Chris.  Here is a patch  
> (a9fa0d91f7d7ee05fd3aca4616db364f68ee0624) that fixes this.


I'm not sure I would have called that a bug. In my reading of the man
page MaxNodes/MaxCPUs work as advertised. As did GrpNodes/GrpCPUs. 

>
> Thanks,
> Danny
>
> On 09/10/12 10:55, Chris Scheller wrote:
>> Andy Wettstein wrote on Sep, 10 09:01:02:
>>> On Sat, Sep 08, 2012 at 06:22:03PM -0600, Chris Scheller wrote:
>>>> Andy Wettstein wrote on Sep, 07 14:33:05:
>>>>> Hi,
>>>>>
>>>>> I'm seeing an issue with the QoS limits not being enforced. I am using
>>>>> slrum 2.4. On the normal QoS I've got MaxCPUsPerUser=1024 and
>>>>> MaxNodesPerUser=64. Those are the only limits besides MaxWall. There is
>>>> I believe those are per job limits. You want to use the GrpCPUs and
>>>> GrpNodes options instead.
>>> That's not my understanding from the manual. From what I can tell
>>> MaxNodes and MaxCPUs is enforced per job
>>> MaxNodesPerUser and MaxCPUsPerUser is enforced for the user
>>> GrpNodes and GrpCPUs is enforced for the qos
>> True unless you apply the grpcpus/grpnodes to the user association
>> level. I do this to limit the total number of cores a single user can
>> use overall their jobs. Kinda annoying to have to apply to the user
>> level but has the intended effect.
>>
>>> AccountingStorageEnforce=limits,qos is set in the slurm.conf.
>>>
>>> I was just now able to understand how to reproduce this. It looks like I
>>> can exceed the per user limits as long as my current jobs are under the
>>> limits and my next to start exceeds them.
>>>
>>> This will help understand the problem I think:
>>>
>>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>>> ??????[$] <> sbatch -N 63 hello1.sh
>>> Submitted batch job 1732073
>>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>>> ??????[$] <> sbatch -N 2 hello1.sh
>>> Submitted batch job 1732074
>>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>>> ??????[$] <> sbatch -N 2 hello1.sh
>>> Submitted batch job 1732075
>>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>>> ??????[$] <> squeue -u wettstein
>>>    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>>> 1732075    sandyb hello1.s wettstei  PD       0:00      2 (QOSResourceLimit)
>>> 1732073    sandyb hello1.s wettstei   R       0:08     63 
>>> midway[043-044,046-047,050,053-074,077-093,095,097,102-103,105-112,115,119-124]
>>> 1732074    sandyb hello1.s wettstei   R       0:04      2 midway[043-044]
>>>
>>>
>>> The second job started and I was able to exceed the MaxNodesPerUser=64
>>> limit. The third job didn't start because I was already over the limit.
>>> It seems like the limit checking might not be taking into account the
>>> number of nodes requested for the job that is being started.
>

-- 
Chris Scheller
Unix System Administrator
Department of Biostatistics
School of Public Health
University of Michigan
Phone: (734) 615-7439
Office: M4218

[slurm-dev] Re: QoS limit issues

Reply via email to