Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-08 Thread Killian Murphy
Hi Thomas.

The output you provided from sacctmgr doesn't look quite right to me. There
is a field count mismatch between the header line and the rows, and I'm not
seeing some fields that I would expect to see, particularly MaxTRESPU
(MaxTRESPerUser) - I don't think this is a Slurm version difference, as I'm
on 18.08.4. Apologies if I'm missing something obvious there!

Do you have AccountingStorageTRES (slurm.conf) set to track GPUs?

Killian





On Thu, 7 May 2020 at 20:32, Theis, Thomas 
wrote:

> Hello Krillian,
>
>
>
> Unfortunately after setting the configuration for the partition to include
> the qos, and restarting the service. Verifying with sacctmgr, I still have
> the same issue..
>
>
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Killian Murphy
> *Sent:* Thursday, May 7, 2020 1:41 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user
> per partition
>
>
>
> External Email
>
> Hi Thomas.
>
>
>
> With that partition configuration, I suspect jobs are going through the
> partition without the QoS 'normal' which restricts the number of GPUs per
> user.
>
>
>
> You may find that reconfiguring the partition to have a QoS of 'normal'
> will result in the GPU limit being applied, as intended. This is set in the
> partition configuration in slurm.conf.
>
>
>
> Killian
>
> On Thu, 7 May 2020 at 18:25, Theis, Thomas 
> wrote:
>
> Here is the outputs
>
> sacctmgr show qos –p
>
>
>
>
> Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
>
>
> normal|1|00:00:00||cluster|||1.00|gres/gpu=2||gres/gpu=2|||
>
> now|100|00:00:00||cluster|||1.00||
>
> high|10|00:00:00||cluster|||1.00||
>
>
>
> scontrol show part
>
>
>
> PartitionName=PART1
>
>AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
>
>AllocNodes=ALL Default=NO QoS=N/A
>
>DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>
>MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>
>Nodes=node1,node2,node3,node4,….   PriorityJobFactor=1 PriorityTier=1
> RootOnly=NO ReqResv=NO OverSubscribe=NO
>
>OverTimeLimit=NONE PreemptMode=OFF
>
>State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
>
>    JobDefaults=(null)
>
>DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Sean Crosby
> *Sent:* Wednesday, May 6, 2020 6:22 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user
> per partition
>
>
>
> External Email
>
> Do you have other limits set? The QoS is hierarchical, and especially
> partition QoS can override other QoS.
>
>
>
> What's the output of
>
>
>
> sacctmgr show qos -p
>
>
>
> and
>
>
>
> scontrol show part
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Wed, 6 May 2020 at 23:44, Theis, Thomas 
> wrote:
>
> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Still have the same issue when I updated the user and qos..
>
> Command I am using.
>
> ‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
>
> I restarted the services. Unfortunately I am still have to saturate the
> cluster with jobs.
>
>
>
> We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus.
> Each node is identical in the software, OS, SLURM. etc.. I am trying to
> limit each user to only be able to use 2 out of 40 gpus across the entire
> cluster or partition. A intended bottle neck so no one can saturate the
> cluster..
>
>
>
> I.E. desired outcome would be. Person A submits 100 jobs, 2 would run ,
> and 98 would be pending, 38 gpus would be idle. Once the 2 running are
> finished, 2 more would run and 96 would be pending, still 38 gpus would be
> idle..
>
>
>
>
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Sean Crosby
> *Sent:* Tuesday, May 5, 2020 6:48 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPU

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-07 Thread Theis, Thomas
Hello Krillian,

Unfortunately after setting the configuration for the partition to include the 
qos, and restarting the service. Verifying with sacctmgr, I still have the same 
issue..


Thomas Theis

From: slurm-users  On Behalf Of Killian 
Murphy
Sent: Thursday, May 7, 2020 1:41 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Hi Thomas.

With that partition configuration, I suspect jobs are going through the 
partition without the QoS 'normal' which restricts the number of GPUs per user.

You may find that reconfiguring the partition to have a QoS of 'normal' will 
result in the GPU limit being applied, as intended. This is set in the 
partition configuration in slurm.conf.

Killian
On Thu, 7 May 2020 at 18:25, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
Here is the outputs
sacctmgr show qos –p

Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
normal|1|00:00:00||cluster|||1.00|gres/gpu=2||gres/gpu=2|||
now|100|00:00:00||cluster|||1.00||
high|10|00:00:00||cluster|||1.00||

scontrol show part

PartitionName=PART1
   AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=node1,node2,node3,node4,….   PriorityJobFactor=1 PriorityTier=1 
RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Thomas Theis

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Sean Crosby
Sent: Wednesday, May 6, 2020 6:22 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Do you have other limits set? The QoS is hierarchical, and especially partition 
QoS can override other QoS.

What's the output of

sacctmgr show qos -p

and

scontrol show part

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 23:44, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.

Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster 
with jobs.

We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each 
node is identical in the software, OS, SLURM. etc.. I am trying to limit each 
user to only be able to use 2 out of 40 gpus across the entire cluster or 
partition. A intended bottle neck so no one can saturate the cluster..

I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 
would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 
more would run and 96 would be pending, still 38 gpus would be idle..



Thomas Theis

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Sean Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 04:53, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.

Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding 
MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm 
control daemon and unfortunately I am still able to run on all the gpus in the 
partition. Any other ideas?

Thomas Theis

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated i

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-07 Thread Killian Murphy
Hi Thomas.

With that partition configuration, I suspect jobs are going through the
partition without the QoS 'normal' which restricts the number of GPUs per
user.

You may find that reconfiguring the partition to have a QoS of 'normal'
will result in the GPU limit being applied, as intended. This is set in the
partition configuration in slurm.conf.

Killian
On Thu, 7 May 2020 at 18:25, Theis, Thomas 
wrote:

> Here is the outputs
>
> sacctmgr show qos –p
>
>
>
>
> Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
>
>
> normal|1|00:00:00||cluster|||1.00|gres/gpu=2||gres/gpu=2|||
>
> now|100|00:00:00||cluster|||1.00||
>
> high|10|00:00:00||cluster|||1.00||
>
>
>
> scontrol show part
>
>
>
> PartitionName=PART1
>
>AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
>
>AllocNodes=ALL Default=NO QoS=N/A
>
>DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>
>MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>
>Nodes=node1,node2,node3,node4,….   PriorityJobFactor=1 PriorityTier=1
> RootOnly=NO ReqResv=NO OverSubscribe=NO
>
>OverTimeLimit=NONE PreemptMode=OFF
>
>State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
>
>JobDefaults=(null)
>
>DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Sean Crosby
> *Sent:* Wednesday, May 6, 2020 6:22 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user
> per partition
>
>
>
> External Email
>
> Do you have other limits set? The QoS is hierarchical, and especially
> partition QoS can override other QoS.
>
>
>
> What's the output of
>
>
>
> sacctmgr show qos -p
>
>
>
> and
>
>
>
> scontrol show part
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Wed, 6 May 2020 at 23:44, Theis, Thomas 
> wrote:
>
> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Still have the same issue when I updated the user and qos..
>
> Command I am using.
>
> ‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
>
> I restarted the services. Unfortunately I am still have to saturate the
> cluster with jobs.
>
>
>
> We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus.
> Each node is identical in the software, OS, SLURM. etc.. I am trying to
> limit each user to only be able to use 2 out of 40 gpus across the entire
> cluster or partition. A intended bottle neck so no one can saturate the
> cluster..
>
>
>
> I.E. desired outcome would be. Person A submits 100 jobs, 2 would run ,
> and 98 would be pending, 38 gpus would be idle. Once the 2 running are
> finished, 2 more would run and 96 would be pending, still 38 gpus would be
> idle..
>
>
>
>
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Sean Crosby
> *Sent:* Tuesday, May 5, 2020 6:48 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user
> per partition
>
>
>
> External Email
>
> Hi Thomas,
>
>
>
> That value should be
>
>
>
> sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Wed, 6 May 2020 at 04:53, Theis, Thomas 
> wrote:
>
> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Hey Killian,
>
>
>
> I tried to limit the number of gpus a user can run on at a time by adding
> MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm
> control daemon and unfortunately I am still able to run on all the gpus in
> the partition. Any other ideas?
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Killian Murphy
> *Sent:* Thursday, April 23, 2020 1:33 PM
> *To:* Slurm User Community List 
> *Subje

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-07 Thread Theis, Thomas
Here is the outputs
sacctmgr show qos –p

Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
normal|1|00:00:00||cluster|||1.00|gres/gpu=2||gres/gpu=2|||
now|100|00:00:00||cluster|||1.00||
high|10|00:00:00||cluster|||1.00||

scontrol show part

PartitionName=PART1
   AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=node1,node2,node3,node4,….   PriorityJobFactor=1 PriorityTier=1 
RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Thomas Theis

From: slurm-users  On Behalf Of Sean 
Crosby
Sent: Wednesday, May 6, 2020 6:22 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Do you have other limits set? The QoS is hierarchical, and especially partition 
QoS can override other QoS.

What's the output of

sacctmgr show qos -p

and

scontrol show part

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 23:44, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.

Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster 
with jobs.

We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each 
node is identical in the software, OS, SLURM. etc.. I am trying to limit each 
user to only be able to use 2 out of 40 gpus across the entire cluster or 
partition. A intended bottle neck so no one can saturate the cluster..

I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 
would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 
more would run and 96 would be pending, still 38 gpus would be idle..



Thomas Theis

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Sean Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 04:53, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.

Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding 
MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm 
control daemon and unfortunately I am still able to run on all the gpus in the 
partition. Any other ideas?

Thomas Theis

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a partition 
through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the 
partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total 
number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` 
QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page 
entitled 'QOS specific limits supported 
(https://slurm.schedmd.com/resource_limits.html) that details some care needed 
when using this kind of limit setting with typed GRES. Although it seems like 
you are trying to do something with generic GRES, it's worth a read!

Killian



On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage 
of jobs per node or use of gpus per node, without blocking a user from 
submitting them.

Example. We have 10 nodes each with

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-06 Thread Sean Crosby
Do you have other limits set? The QoS is hierarchical, and especially
partition QoS can override other QoS.

What's the output of

sacctmgr show qos -p

and

scontrol show part

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Wed, 6 May 2020 at 23:44, Theis, Thomas 
wrote:

> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Still have the same issue when I updated the user and qos..
>
> Command I am using.
>
> ‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
>
> I restarted the services. Unfortunately I am still have to saturate the
> cluster with jobs.
>
>
>
> We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus.
> Each node is identical in the software, OS, SLURM. etc.. I am trying to
> limit each user to only be able to use 2 out of 40 gpus across the entire
> cluster or partition. A intended bottle neck so no one can saturate the
> cluster..
>
>
>
> I.E. desired outcome would be. Person A submits 100 jobs, 2 would run ,
> and 98 would be pending, 38 gpus would be idle. Once the 2 running are
> finished, 2 more would run and 96 would be pending, still 38 gpus would be
> idle..
>
>
>
>
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Sean Crosby
> *Sent:* Tuesday, May 5, 2020 6:48 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user
> per partition
>
>
>
> External Email
>
> Hi Thomas,
>
>
>
> That value should be
>
>
>
> sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4
>
>
>
> Sean
>
>
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
>
>
>
> On Wed, 6 May 2020 at 04:53, Theis, Thomas 
> wrote:
>
> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Hey Killian,
>
>
>
> I tried to limit the number of gpus a user can run on at a time by adding
> MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm
> control daemon and unfortunately I am still able to run on all the gpus in
> the partition. Any other ideas?
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Killian Murphy
> *Sent:* Thursday, April 23, 2020 1:33 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Limit the number of GPUS per user per
> partition
>
>
>
> External Email
>
> Hi Thomas.
>
>
>
> We limit the maximum number of GPUs a user can have allocated in a
> partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is
> set as the partition QoS on our GPU partition. I.E:
>
>
>
> We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit
> total number of allocated GPUs to 4, and set the GPU partition QoS to the
> `gpujobs` QoS.
>
>
>
> There is a section in the Slurm documentation on the 'Resource Limits'
> page entitled 'QOS specific limits supported (
> https://slurm.schedmd.com/resource_limits.html) that details some care
> needed when using this kind of limit setting with typed GRES. Although it
> seems like you are trying to do something with generic GRES, it's worth a
> read!
>
>
>
> Killian
>
>
>
>
>
>
>
> On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
> wrote:
>
> Hi everyone,
>
> First message, I am trying find a good way or multiple ways to limit the
> usage of jobs per node or use of gpus per node, without blocking a user
> from submitting them.
>
>
>
> Example. We have 10 nodes each with 4 gpus in a partition. We allow a team
> of 6 people to submit jobs to any or all of the nodes. One job per gpu;
> thus we can hold a total of 40 jobs concurrently in the partition.
>
> At the moment: each user usually submit 50- 100 jobs at once. Taking up
> all gpus, and all other users have to wait in pending..
>
>
>
> What I am trying to setup is allow all users to submit as many jobs as
> they wish but only run on 1 out of the 4 gpus per node, or some number out
> of the total 40 gpus across the entire partition. Using slurm 18.08.3..
>
>
>
> This is roughly our slurm scripts.
>
>
>
> #SBATCH --job-name=Name # Job name
>
> #SBATCH --mem=5gb # Job memory request
>
> #SBA

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-06 Thread Theis, Thomas
Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster 
with jobs.

We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each 
node is identical in the software, OS, SLURM. etc.. I am trying to limit each 
user to only be able to use 2 out of 40 gpus across the entire cluster or 
partition. A intended bottle neck so no one can saturate the cluster..

I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 
would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 
more would run and 96 would be pending, still 38 gpus would be idle..



Thomas Theis

From: slurm-users  On Behalf Of Sean 
Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per 
partition

External Email
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


On Wed, 6 May 2020 at 04:53, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation 
attempts.

Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding 
MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm 
control daemon and unfortunately I am still able to run on all the gpus in the 
partition. Any other ideas?

Thomas Theis

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a partition 
through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the 
partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total 
number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` 
QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page 
entitled 'QOS specific limits supported 
(https://slurm.schedmd.com/resource_limits.html) that details some care needed 
when using this kind of limit setting with typed GRES. Although it seems like 
you are trying to do something with generic GRES, it's worth a read!

Killian



On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
mailto:thomas.th...@teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage 
of jobs per node or use of gpus per node, without blocking a user from 
submitting them.

Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 
people to submit jobs to any or all of the nodes. One job per gpu; thus we can 
hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all 
gpus, and all other users have to wait in pending..

What I am trying to setup is allow all users to submit as many jobs as they 
wish but only run on 1 out of the 4 gpus per node, or some number out of the 
total 40 gpus across the entire partition. Using slurm 18.08.3..

This is roughly our slurm scripts.

#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00   # Time limit hrs:min:sec
#SBATCH --output=job _%j.log # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high

srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS 
nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e 
SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"

Thomas Theis



--
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm


Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-05 Thread Chris Samuel
On Tuesday, 5 May 2020 3:48:22 PM PDT Sean Crosby wrote:

> sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Also don't forget you need to tell Slurm to enforce QOS limits with:

AccountingStorageEnforce=safe,qos

in your Slurm configuration ("safe" is good to set, and turns on enforcement of 
other restrictions around associations too).  See:

https://slurm.schedmd.com/resource_limits.html

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-05 Thread Sean Crosby
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Wed, 6 May 2020 at 04:53, Theis, Thomas 
wrote:

> *UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts.*
> --
>
> Hey Killian,
>
>
>
> I tried to limit the number of gpus a user can run on at a time by adding
> MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm
> control daemon and unfortunately I am still able to run on all the gpus in
> the partition. Any other ideas?
>
>
>
> *Thomas Theis*
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Killian Murphy
> *Sent:* Thursday, April 23, 2020 1:33 PM
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Limit the number of GPUS per user per
> partition
>
>
>
> External Email
>
> Hi Thomas.
>
>
>
> We limit the maximum number of GPUs a user can have allocated in a
> partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is
> set as the partition QoS on our GPU partition. I.E:
>
>
>
> We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit
> total number of allocated GPUs to 4, and set the GPU partition QoS to the
> `gpujobs` QoS.
>
>
>
> There is a section in the Slurm documentation on the 'Resource Limits'
> page entitled 'QOS specific limits supported (
> https://slurm.schedmd.com/resource_limits.html) that details some care
> needed when using this kind of limit setting with typed GRES. Although it
> seems like you are trying to do something with generic GRES, it's worth a
> read!
>
>
>
> Killian
>
>
>
>
>
>
>
> On Thu, 23 Apr 2020 at 18:19, Theis, Thomas 
> wrote:
>
> Hi everyone,
>
> First message, I am trying find a good way or multiple ways to limit the
> usage of jobs per node or use of gpus per node, without blocking a user
> from submitting them.
>
>
>
> Example. We have 10 nodes each with 4 gpus in a partition. We allow a team
> of 6 people to submit jobs to any or all of the nodes. One job per gpu;
> thus we can hold a total of 40 jobs concurrently in the partition.
>
> At the moment: each user usually submit 50- 100 jobs at once. Taking up
> all gpus, and all other users have to wait in pending..
>
>
>
> What I am trying to setup is allow all users to submit as many jobs as
> they wish but only run on 1 out of the 4 gpus per node, or some number out
> of the total 40 gpus across the entire partition. Using slurm 18.08.3..
>
>
>
> This is roughly our slurm scripts.
>
>
>
> #SBATCH --job-name=Name # Job name
>
> #SBATCH --mem=5gb # Job memory request
>
> #SBATCH --ntasks=1
>
> #SBATCH --gres=gpu:1
>
> #SBATCH --partition=PART1
>
> #SBATCH --time=200:00:00   # Time limit hrs:min:sec
>
> #SBATCH --output=job _%j.log # Standard output and error log
>
> #SBATCH --nodes=1
>
> #SBATCH --qos=high
>
>
>
> srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c
> "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e
> SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name
> $SLURM_JOB_ID do_job.sh"
>
>
>
> *Thomas Theis*
>
>
>
>
>
>
> --
>
> Killian Murphy
>
> Research Software Engineer
>
>
>
> Wolfson Atmospheric Chemistry Laboratories
> University of York
> Heslington
> York
> YO10 5DD
> +44 (0)1904 32 4753
>
> e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
>