[slurm-dev] Re: QoS limit issues

Danny Auble Tue, 11 Sep 2012 13:02:09 -0700

Thanks for reporting this Chris.  Here is a patch 
(a9fa0d91f7d7ee05fd3aca4616db364f68ee0624) that fixes this.


Thanks,
Danny

On 09/10/12 10:55, Chris Scheller wrote:
> Andy Wettstein wrote on Sep, 10 09:01:02:
>> On Sat, Sep 08, 2012 at 06:22:03PM -0600, Chris Scheller wrote:
>>> Andy Wettstein wrote on Sep, 07 14:33:05:
>>>> Hi,
>>>>
>>>> I'm seeing an issue with the QoS limits not being enforced. I am using
>>>> slrum 2.4. On the normal QoS I've got MaxCPUsPerUser=1024 and
>>>> MaxNodesPerUser=64. Those are the only limits besides MaxWall. There is
>>> I believe those are per job limits. You want to use the GrpCPUs and
>>> GrpNodes options instead.
>> That's not my understanding from the manual. From what I can tell
>> MaxNodes and MaxCPUs is enforced per job
>> MaxNodesPerUser and MaxCPUsPerUser is enforced for the user
>> GrpNodes and GrpCPUs is enforced for the qos
> True unless you apply the grpcpus/grpnodes to the user association
> level. I do this to limit the total number of cores a single user can
> use overall their jobs. Kinda annoying to have to apply to the user
> level but has the intended effect.
>
>> AccountingStorageEnforce=limits,qos is set in the slurm.conf.
>>
>> I was just now able to understand how to reproduce this. It looks like I
>> can exceed the per user limits as long as my current jobs are under the
>> limits and my next to start exceeds them.
>>
>> This will help understand the problem I think:
>>
>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>> ??????[$] <> sbatch -N 63 hello1.sh
>> Submitted batch job 1732073
>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>> ??????[$] <> sbatch -N 2 hello1.sh
>> Submitted batch job 1732074
>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>> ??????[$] <> sbatch -N 2 hello1.sh
>> Submitted batch job 1732075
>> ??????[wettstein@midway-login2] - [~/mpi] - [Mon Sep 10, 09:16]
>> ??????[$] <> squeue -u wettstein
>>    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
>> 1732075    sandyb hello1.s wettstei  PD       0:00      2 (QOSResourceLimit)
>> 1732073    sandyb hello1.s wettstei   R       0:08     63 
>> midway[043-044,046-047,050,053-074,077-093,095,097,102-103,105-112,115,119-124]
>> 1732074    sandyb hello1.s wettstei   R       0:04      2 midway[043-044]
>>
>>
>> The second job started and I was able to exceed the MaxNodesPerUser=64
>> limit. The third job didn't start because I was already over the limit.
>> It seems like the limit checking might not be taking into account the
>> number of nodes requested for the job that is being started.

[slurm-dev] Re: QoS limit issues

Reply via email to