Re: [slurm-users] Longer queuing times for larger jobs

2020-02-12 Thread Chris Samuel

On 5/2/20 1:44 pm, Antony Cleave wrote:

Hi, from what you are describing it sounds like jobs are backfilling in 
front and stopping the large jobs from starting


We use a feature that SchedMD implemented for us called 
"bf_min_prio_reserve" which lets you set a priority threshold below 
which Slurm won't make a forward reservation for a job (and so can only 
start if it can start right now without delaying other jobs).


https://slurm.schedmd.com/slurm.conf.html#OPT_bf_min_prio_reserve

So if you can arrange your local priority system so that large jobs are 
over that threshold and smaller jobs are below it (or whatever suits 
your use case) then you should have a way to let these large jobs get a 
reliable start time without smaller jobs pushing them back in time.


There's some useful background from the bug where this was implemented:

https://bugs.schedmd.com/show_bug.cgi?id=2565

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Longer queuing times for larger jobs

2020-02-12 Thread Loris Bennett
Loris Bennett  writes:

> Hello David,
>
> David Baker  writes:
>
>> Hello,
>>
>> I've taken a very good look at our cluster, however as yet not made
>> any significant changes. The one change that I did make was to
>> increase the "jobsizeweight". That's now our dominant parameter and it
>> does ensure that our largest jobs (> 20 nodes) are making it to the
>> top of the sprio listing which is what we want to see.
>>
>> These large jobs aren't making an progress despite the priority
>> lift. I additionally decreased the nice value of the job that sparked
>> this discussion. That is (looking at at sprio) there is a 32 node job
>> with a very high priority...
>>
>> JOBID PARTITION USER   PRIORITYAGE  FAIRSHAREJOBSIZE  
>> PARTITIONQOSNICE
>> 280919 batch  mep1c101275481 40  59827 415655
>>   0  0 -40
>>
>> That job has been sitting in the queue for well over a week and it is
>> disconcerting that we never see nodes becoming idle in order to
>> service these large jobs. Nodes do become idle and then get scooped by
>> jobs started by backfill. Looking at the slurmctld logs I see that the
>> vast majority of jobs are being started via backfill -- including, for
>> example, a 24 node job. I see very few jobs allocated by the
>> scheduler. That is, messages like sched: Allocate JobId)6915 are few
>> and far between and I never see any of the large jobs being allocated
>> in the batch queue.
>>
>> Surely, this is not correct, however does anyone have any advice on
>> what to check, please?
>
> Have you looked at what 'sprio' says?  I usually want to see the list
> sorted by priority and so call it like this:
>
>   sprio -l -S "%Y"

This should be

  sprio -l -S "Y"

[snip (242 lines)]

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] Longer queuing times for larger jobs

2020-02-05 Thread Antony Cleave
Hi, from what you are describing it sounds like jobs are backfilling in
front and stopping the large jobs from starting

You probably need to tweak your backfill window in schedulerparameters in
slurm.conf see here

*bf_window=#*The number of minutes into the future to look when considering
jobs to schedule. Higher values result in more overhead and less
responsiveness. A value at least as long as the highest allowed time limit
is generally advisable to prevent job starvation. In order to limit the
amount of data managed by the backfill scheduler, if the value of
*bf_window* is increased, then it is generally advisable to also increase
*bf_resolution*. This option applies only to *SchedulerType=sched/backfill*.
Default: 1440 (1 day), Min: 1, Max: 43200 (30 days).

On Tue, 4 Feb 2020, 10:43 David Baker,  wrote:

> Hello,
>
> Thank you very much again for your comments and the details of your slurm
> configuration. All the information is really useful. We are working on our
> cluster right now and making some appropriate changes. We'll see how we get
> on over the next 24 hours or so.
>
> Best regards,
> David
> --
> *From:* slurm-users  on behalf of
> Renfro, Michael 
> *Sent:* 31 January 2020 22:08
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Longer queuing times for larger jobs
>
> Slurm 19.05 now, though all these settings were in effect on 17.02 until
> quite recently. If I get some detail wrong below, I hope someone will
> correct me. But this is our current working state. We’ve been able to
> schedule 10-20k jobs per month since late 2017, and we successfully
> scheduled 320k jobs over December and January (largely due to one user
> using some form of automated submission for very short jobs).
>
> Basic scheduler setup:
>
> As I’d said previously, we prioritize on fairshare almost exclusively.
> Most of our jobs (molecular dynamics, CFD) end up in a single batch
> partition, since GPU and big-memory jobs have other partitions.
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
> PriorityWeightFairshare=10
> PriorityWeightAge=1000
> PriorityWeightPartition=1
> PriorityWeightJobSize=1000
> PriorityMaxAge=1-0
>
> TRES limits:
>
> We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser
> set grptresrunmin=cpu=144 — there might be a way of doing this at a
> higher accounting level, but it works as is.
>
> We also force QoS=gpu in each GPU partition’s definition in slurm.conf,
> and set MaxJobsPerUser equal to our total GPU count. That helps prevent
> users from queue-stuffing the GPUs even if they stay well below the 1000
> CPU-day TRES limit above.
>
> Backfill:
>
>   SchedulerType=sched/backfill
>
> SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200
>
> Can’t remember where I found the backfill guidance, but:
>
> - bf_window is set to our maximum job length (30 days) and bf_resolution
> is set to 1.5 days. Most of our users’ jobs are well over 1 day.
> - We have had users who didn’t use job arrays, and submitted a ton of
> small jobs at once, thus bf_max_job_user gives the scheduler a chance to
> start up to 80 jobs per user each cycle. This also prompted us to increase
> default_queue_depth, so the backfill scheduler would examine more jobs each
> cycle.
> - bf_continue should let the backfill scheduler continue where it left off
> if it gets interrupted, instead of having to start from scratch each time.
>
> I can guarantee you that our backfilling was sub-par until we tuned these
> parameters (or at least a few users could find a way to submit so many jobs
> that the backfill couldn’t keep up, even when we had idle resources for
> their very short jobs).
>
> > On Jan 31, 2020, at 3:01 PM, David Baker  wrote:
> >
> > External Email Warning
> > This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.
> > Hello,
> >
> > Thank you for your detailed reply. That’s all very useful. I manage to
> mistype our cluster size since there are actually 450 standard compute, 40
> core, compute nodes. What you say is interesting and so it concerns me that
> things are so bad at the moment,
> >
> > I wondered if you could please give me some more details of how you use
> TRES to throttle user activity. We have applied some limits to throttle
> users, however perhaps not enough or not well enough. So the details of
> what you do would be really appreciated, please.
> >
> > In addition, we do use backfill, ho

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-05 Thread Loris Bennett
Hello David,

David Baker  writes:

> Hello,
>
> I've taken a very good look at our cluster, however as yet not made
> any significant changes. The one change that I did make was to
> increase the "jobsizeweight". That's now our dominant parameter and it
> does ensure that our largest jobs (> 20 nodes) are making it to the
> top of the sprio listing which is what we want to see.
>
> These large jobs aren't making an progress despite the priority
> lift. I additionally decreased the nice value of the job that sparked
> this discussion. That is (looking at at sprio) there is a 32 node job
> with a very high priority...
>
> JOBID PARTITION USER   PRIORITYAGE  FAIRSHAREJOBSIZE  
> PARTITIONQOSNICE
> 280919 batch  mep1c101275481 40  59827 415655 
>  0  0 -40
>
> That job has been sitting in the queue for well over a week and it is
> disconcerting that we never see nodes becoming idle in order to
> service these large jobs. Nodes do become idle and then get scooped by
> jobs started by backfill. Looking at the slurmctld logs I see that the
> vast majority of jobs are being started via backfill -- including, for
> example, a 24 node job. I see very few jobs allocated by the
> scheduler. That is, messages like sched: Allocate JobId=296915 are few
> and far between and I never see any of the large jobs being allocated
> in the batch queue.
>
> Surely, this is not correct, however does anyone have any advice on
> what to check, please?

Have you looked at what 'sprio' says?  I usually want to see the list
sorted by priority and so call it like this:

  sprio -l -S "%Y"

If you run

 scontrol show job 

is the entry 'NodeList' ever anything other than '(null)'?

Cheers,

Loris

> Best regards,
> David
> --------------
> From: slurm-users  on behalf of 
> Killian Murphy 
> Sent: 04 February 2020 10:48
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs 
>  
> Hi David. 
>
> I'd love to hear back about the changes that you make and how they affect the 
> performance of your scheduler.
>
> Any chance you could let us know how things go?
>
> Killian
>
> On Tue, 4 Feb 2020 at 10:43, David Baker  wrote:
>
>  Hello,
>
>  Thank you very much again for your comments and the details of your slurm 
> configuration. All the information is really useful. We are working on our 
> cluster right now and making some appropriate changes.
>  We'll see how we get on over the next 24 hours or so.
>
>  Best regards,
>  David
> --------------------------
>  From: slurm-users  on behalf of 
> Renfro, Michael 
>  Sent: 31 January 2020 22:08
>  To: Slurm User Community List 
>  Subject: Re: [slurm-users] Longer queuing times for larger jobs 
>   
>  Slurm 19.05 now, though all these settings were in effect on 17.02 until 
> quite recently. If I get some detail wrong below, I hope someone will correct 
> me. But this is our current working state. We’ve been able to
>  schedule 10-20k jobs per month since late 2017, and we successfully 
> scheduled 320k jobs over December and January (largely due to one user using 
> some form of automated submission for very short jobs).
>
>  Basic scheduler setup:
>
>  As I’d said previously, we prioritize on fairshare almost exclusively. Most 
> of our jobs (molecular dynamics, CFD) end up in a single batch partition, 
> since GPU and big-memory jobs have other partitions.
>
>  SelectType=select/cons_res
>  SelectTypeParameters=CR_Core_Memory
>  PriorityType=priority/multifactor
>  PriorityDecayHalfLife=14-0
>  PriorityWeightFairshare=10
>  PriorityWeightAge=1000
>  PriorityWeightPartition=1
>  PriorityWeightJobSize=1000
>  PriorityMaxAge=1-0
>
>  TRES limits:
>
>  We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
> grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
> accounting level, but it works as is.
>
>  We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and 
> set MaxJobsPerUser equal to our total GPU count. That helps prevent users 
> from queue-stuffing the GPUs even if they stay well below
>  the 1000 CPU-day TRES limit above.
>
>  Backfill:
>
>SchedulerType=sched/backfill
>

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread David Baker
Hello,

I've taken a very good look at our cluster, however as yet not made any 
significant changes. The one change that I did make was to increase the 
"jobsizeweight". That's now our dominant parameter and it does ensure that our 
largest jobs (> 20 nodes) are making it to the top of the sprio listing which 
is what we want to see.

These large jobs aren't making an progress despite the priority lift. I 
additionally decreased the nice value of the job that sparked this discussion. 
That is (looking at at sprio) there is a 32 node job with a very high 
priority...

JOBID PARTITION USER   PRIORITYAGE  FAIRSHAREJOBSIZE  PARTITION 
   QOSNICE
280919 batch  mep1c101275481 40  59827 415655  
0  0 -40

That job has been sitting in the queue for well over a week and it is 
disconcerting that we never see nodes becoming idle in order to service these 
large jobs. Nodes do become idle and then get scooped by jobs started by 
backfill. Looking at the slurmctld logs I see that the  vast majority of jobs 
are being started via backfill -- including, for example, a 24 node job. I see 
very few jobs allocated by the scheduler. That is, messages like sched: 
Allocate JobId=296915 are few and far between and I never see any of the large 
jobs being allocated in the batch queue.

Surely, this is not correct, however does anyone have any advice on what to 
check, please?

Best regards,
David

From: slurm-users  on behalf of Killian 
Murphy 
Sent: 04 February 2020 10:48
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Hi David.

I'd love to hear back about the changes that you make and how they affect the 
performance of your scheduler.

Any chance you could let us know how things go?

Killian

On Tue, 4 Feb 2020 at 10:43, David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

Thank you very much again for your comments and the details of your slurm 
configuration. All the information is really useful. We are working on our 
cluster right now and making some appropriate changes. We'll see how we get on 
over the next 24 hours or so.

Best regards,
David

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Renfro, Michael mailto:ren...@tntech.edu>>
Sent: 31 January 2020 22:08
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Slurm 19.05 now, though all these settings were in effect on 17.02 until quite 
recently. If I get some detail wrong below, I hope someone will correct me. But 
this is our current working state. We’ve been able to schedule 10-20k jobs per 
month since late 2017, and we successfully scheduled 320k jobs over December 
and January (largely due to one user using some form of automated submission 
for very short jobs).

Basic scheduler setup:

As I’d said previously, we prioritize on fairshare almost exclusively. Most of 
our jobs (molecular dynamics, CFD) end up in a single batch partition, since 
GPU and big-memory jobs have other partitions.

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

TRES limits:

We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
accounting level, but it works as is.

We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set 
MaxJobsPerUser equal to our total GPU count. That helps prevent users from 
queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES 
limit above.

Backfill:

  SchedulerType=sched/backfill
  
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

Can’t remember where I found the backfill guidance, but:

- bf_window is set to our maximum job length (30 days) and bf_resolution is set 
to 1.5 days. Most of our users’ jobs are well over 1 day.
- We have had users who didn’t use job arrays, and submitted a ton of small 
jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 
80 jobs per user each cycle. This also prompted us to increase 
default_queue_depth, so the backfill scheduler would examine more jobs each 
cycle.
- bf_continue should let the backfill scheduler continue where it left off if 
it gets interrupted, instead of having to start from scratch each time.

I can guarantee you that our backfilling was sub-par until we tuned these 
parameters (or at least a few users could find a way to submit so many jobs 
that the backfill couldn’t keep up, even when we had 

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread Killian Murphy
Hi David.

I'd love to hear back about the changes that you make and how they affect
the performance of your scheduler.

Any chance you could let us know how things go?

Killian

On Tue, 4 Feb 2020 at 10:43, David Baker  wrote:

> Hello,
>
> Thank you very much again for your comments and the details of your slurm
> configuration. All the information is really useful. We are working on our
> cluster right now and making some appropriate changes. We'll see how we get
> on over the next 24 hours or so.
>
> Best regards,
> David
> --
> *From:* slurm-users  on behalf of
> Renfro, Michael 
> *Sent:* 31 January 2020 22:08
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] Longer queuing times for larger jobs
>
> Slurm 19.05 now, though all these settings were in effect on 17.02 until
> quite recently. If I get some detail wrong below, I hope someone will
> correct me. But this is our current working state. We’ve been able to
> schedule 10-20k jobs per month since late 2017, and we successfully
> scheduled 320k jobs over December and January (largely due to one user
> using some form of automated submission for very short jobs).
>
> Basic scheduler setup:
>
> As I’d said previously, we prioritize on fairshare almost exclusively.
> Most of our jobs (molecular dynamics, CFD) end up in a single batch
> partition, since GPU and big-memory jobs have other partitions.
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
> PriorityWeightFairshare=10
> PriorityWeightAge=1000
> PriorityWeightPartition=1
> PriorityWeightJobSize=1000
> PriorityMaxAge=1-0
>
> TRES limits:
>
> We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser
> set grptresrunmin=cpu=144 — there might be a way of doing this at a
> higher accounting level, but it works as is.
>
> We also force QoS=gpu in each GPU partition’s definition in slurm.conf,
> and set MaxJobsPerUser equal to our total GPU count. That helps prevent
> users from queue-stuffing the GPUs even if they stay well below the 1000
> CPU-day TRES limit above.
>
> Backfill:
>
>   SchedulerType=sched/backfill
>
> SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200
>
> Can’t remember where I found the backfill guidance, but:
>
> - bf_window is set to our maximum job length (30 days) and bf_resolution
> is set to 1.5 days. Most of our users’ jobs are well over 1 day.
> - We have had users who didn’t use job arrays, and submitted a ton of
> small jobs at once, thus bf_max_job_user gives the scheduler a chance to
> start up to 80 jobs per user each cycle. This also prompted us to increase
> default_queue_depth, so the backfill scheduler would examine more jobs each
> cycle.
> - bf_continue should let the backfill scheduler continue where it left off
> if it gets interrupted, instead of having to start from scratch each time.
>
> I can guarantee you that our backfilling was sub-par until we tuned these
> parameters (or at least a few users could find a way to submit so many jobs
> that the backfill couldn’t keep up, even when we had idle resources for
> their very short jobs).
>
> > On Jan 31, 2020, at 3:01 PM, David Baker  wrote:
> >
> > External Email Warning
> > This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.
> > Hello,
> >
> > Thank you for your detailed reply. That’s all very useful. I manage to
> mistype our cluster size since there are actually 450 standard compute, 40
> core, compute nodes. What you say is interesting and so it concerns me that
> things are so bad at the moment,
> >
> > I wondered if you could please give me some more details of how you use
> TRES to throttle user activity. We have applied some limits to throttle
> users, however perhaps not enough or not well enough. So the details of
> what you do would be really appreciated, please.
> >
> > In addition, we do use backfill, however we rarely see nodes being freed
> up in the cluster to make way for high priority work which again concerns
> me. If you could please share your backfill configuration then that would
> be appreciated, please.
> >
> > Finally, which version of Slurm are you running? We are using an early
> release of v18.
> >
> > Best regards,
> > David
> >
> > From: slurm-users  on behalf of
> Renfro, Michael 
> > Sent: 31 January 2020 17:23:05
> > To: Slurm User Community List 
> > Subject: Re: [slurm-users] Longer queuing t

Re: [slurm-users] Longer queuing times for larger jobs

2020-02-04 Thread David Baker
Hello,

Thank you very much again for your comments and the details of your slurm 
configuration. All the information is really useful. We are working on our 
cluster right now and making some appropriate changes. We'll see how we get on 
over the next 24 hours or so.

Best regards,
David

From: slurm-users  on behalf of Renfro, 
Michael 
Sent: 31 January 2020 22:08
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Slurm 19.05 now, though all these settings were in effect on 17.02 until quite 
recently. If I get some detail wrong below, I hope someone will correct me. But 
this is our current working state. We’ve been able to schedule 10-20k jobs per 
month since late 2017, and we successfully scheduled 320k jobs over December 
and January (largely due to one user using some form of automated submission 
for very short jobs).

Basic scheduler setup:

As I’d said previously, we prioritize on fairshare almost exclusively. Most of 
our jobs (molecular dynamics, CFD) end up in a single batch partition, since 
GPU and big-memory jobs have other partitions.

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

TRES limits:

We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
accounting level, but it works as is.

We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set 
MaxJobsPerUser equal to our total GPU count. That helps prevent users from 
queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES 
limit above.

Backfill:

  SchedulerType=sched/backfill
  
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

Can’t remember where I found the backfill guidance, but:

- bf_window is set to our maximum job length (30 days) and bf_resolution is set 
to 1.5 days. Most of our users’ jobs are well over 1 day.
- We have had users who didn’t use job arrays, and submitted a ton of small 
jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 
80 jobs per user each cycle. This also prompted us to increase 
default_queue_depth, so the backfill scheduler would examine more jobs each 
cycle.
- bf_continue should let the backfill scheduler continue where it left off if 
it gets interrupted, instead of having to start from scratch each time.

I can guarantee you that our backfilling was sub-par until we tuned these 
parameters (or at least a few users could find a way to submit so many jobs 
that the backfill couldn’t keep up, even when we had idle resources for their 
very short jobs).

> On Jan 31, 2020, at 3:01 PM, David Baker  wrote:
>
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
>
> Thank you for your detailed reply. That’s all very useful. I manage to 
> mistype our cluster size since there are actually 450 standard compute, 40 
> core, compute nodes. What you say is interesting and so it concerns me that 
> things are so bad at the moment,
>
> I wondered if you could please give me some more details of how you use TRES 
> to throttle user activity. We have applied some limits to throttle users, 
> however perhaps not enough or not well enough. So the details of what you do 
> would be really appreciated, please.
>
> In addition, we do use backfill, however we rarely see nodes being freed up 
> in the cluster to make way for high priority work which again concerns me. If 
> you could please share your backfill configuration then that would be 
> appreciated, please.
>
> Finally, which version of Slurm are you running? We are using an early 
> release of v18.
>
> Best regards,
> David
>
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 January 2020 17:23:05
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs
>
> I missed reading what size your cluster was at first, but found it on a 
> second read. Our cluster and typical maximum job size scales about the same 
> way, though (our users’ typical job size is anywhere from a few cores up to 
> 10% of our core count).
>
> There are several recommendations to separate your priority weights by an 
> order of magnitude or so. Our weights are dominated by fairshare, and we 
> effectively ignore all other factors.
>
> We also put TRES limits on by default, so that users can’t queue-stuff beyond 
> a certain limit (any jobs totaling under around 1 cluster-

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
Slurm 19.05 now, though all these settings were in effect on 17.02 until quite 
recently. If I get some detail wrong below, I hope someone will correct me. But 
this is our current working state. We’ve been able to schedule 10-20k jobs per 
month since late 2017, and we successfully scheduled 320k jobs over December 
and January (largely due to one user using some form of automated submission 
for very short jobs).

Basic scheduler setup:

As I’d said previously, we prioritize on fairshare almost exclusively. Most of 
our jobs (molecular dynamics, CFD) end up in a single batch partition, since 
GPU and big-memory jobs have other partitions.

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

TRES limits:

We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
grptresrunmin=cpu=144 — there might be a way of doing this at a higher 
accounting level, but it works as is.

We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and set 
MaxJobsPerUser equal to our total GPU count. That helps prevent users from 
queue-stuffing the GPUs even if they stay well below the 1000 CPU-day TRES 
limit above.

Backfill:

  SchedulerType=sched/backfill
  
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

Can’t remember where I found the backfill guidance, but:

- bf_window is set to our maximum job length (30 days) and bf_resolution is set 
to 1.5 days. Most of our users’ jobs are well over 1 day.
- We have had users who didn’t use job arrays, and submitted a ton of small 
jobs at once, thus bf_max_job_user gives the scheduler a chance to start up to 
80 jobs per user each cycle. This also prompted us to increase 
default_queue_depth, so the backfill scheduler would examine more jobs each 
cycle.
- bf_continue should let the backfill scheduler continue where it left off if 
it gets interrupted, instead of having to start from scratch each time.

I can guarantee you that our backfilling was sub-par until we tuned these 
parameters (or at least a few users could find a way to submit so many jobs 
that the backfill couldn’t keep up, even when we had idle resources for their 
very short jobs).

> On Jan 31, 2020, at 3:01 PM, David Baker  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
> 
> Thank you for your detailed reply. That’s all very useful. I manage to 
> mistype our cluster size since there are actually 450 standard compute, 40 
> core, compute nodes. What you say is interesting and so it concerns me that 
> things are so bad at the moment,
> 
> I wondered if you could please give me some more details of how you use TRES 
> to throttle user activity. We have applied some limits to throttle users, 
> however perhaps not enough or not well enough. So the details of what you do 
> would be really appreciated, please.
> 
> In addition, we do use backfill, however we rarely see nodes being freed up 
> in the cluster to make way for high priority work which again concerns me. If 
> you could please share your backfill configuration then that would be 
> appreciated, please.
> 
> Finally, which version of Slurm are you running? We are using an early 
> release of v18.
> 
> Best regards,
> David
> 
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 January 2020 17:23:05
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs
>  
> I missed reading what size your cluster was at first, but found it on a 
> second read. Our cluster and typical maximum job size scales about the same 
> way, though (our users’ typical job size is anywhere from a few cores up to 
> 10% of our core count).
> 
> There are several recommendations to separate your priority weights by an 
> order of magnitude or so. Our weights are dominated by fairshare, and we 
> effectively ignore all other factors.
> 
> We also put TRES limits on by default, so that users can’t queue-stuff beyond 
> a certain limit (any jobs totaling under around 1 cluster-day can be in a 
> running or queued state, and anything past that is ignored until their 
> running jobs burn off some of their time). This allows other users’ jobs to 
> have a chance to run if resources are available, even if they were submitted 
> well after the heavy users’ blocked jobs.
> 
> We also make extensive use of the backfill scheduler to run small, short jobs 
> earlier than their queue time might allow, if and only if they don’t delay 
> other jo

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker
Hello,

Thank you for your detailed reply. That’s all very useful. I manage to mistype 
our cluster size since there are actually 450 standard compute, 40 core, 
compute nodes. What you say is interesting and so it concerns me that things 
are so bad at the moment,

I wondered if you could please give me some more details of how you use TRES to 
throttle user activity. We have applied some limits to throttle users, however 
perhaps not enough or not well enough. So the details of what you do would be 
really appreciated, please.

In addition, we do use backfill, however we rarely see nodes being freed up in 
the cluster to make way for high priority work which again concerns me. If you 
could please share your backfill configuration then that would be appreciated, 
please.

Finally, which version of Slurm are you running? We are using an early release 
of v18.

Best regards,
David


From: slurm-users  on behalf of Renfro, 
Michael 
Sent: 31 January 2020 17:23:05
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

I missed reading what size your cluster was at first, but found it on a second 
read. Our cluster and typical maximum job size scales about the same way, 
though (our users’ typical job size is anywhere from a few cores up to 10% of 
our core count).

There are several recommendations to separate your priority weights by an order 
of magnitude or so. Our weights are dominated by fairshare, and we effectively 
ignore all other factors.

We also put TRES limits on by default, so that users can’t queue-stuff beyond a 
certain limit (any jobs totaling under around 1 cluster-day can be in a running 
or queued state, and anything past that is ignored until their running jobs 
burn off some of their time). This allows other users’ jobs to have a chance to 
run if resources are available, even if they were submitted well after the 
heavy users’ blocked jobs.

We also make extensive use of the backfill scheduler to run small, short jobs 
earlier than their queue time might allow, if and only if they don’t delay 
other jobs. If a particularly large job is about to run, we can see the nodes 
gradually empty out, which opens up lots of capacity for very short jobs.

Overall, our average wait times since September 2017 haven’t exceeded 90 hours 
for any job size, and I’m pretty sure a *lot* of that wait is due to a few 
heavy users submitting large numbers of jobs far beyond the TRES limit. Even 
our jobs of 5-10% cluster size have average start times of 60 hours or less 
(and we've managed under 48 hours for those size jobs for all but 2 months of 
that period), but those larger jobs tend to be run by our lighter users, and 
they get a major improvement to their queue time due to being far below their 
fairshare target.

We’ve been running at >50% capacity since May 2018, and >60% capacity since 
December 2018, and >80% capacity since February 2019. So our wait times aren’t 
due to having a ton of spare capacity for extended periods of time.

Not sure how much of that will help immediately, but it may give you some ideas.

> On Jan 31, 2020, at 10:14 AM, David Baker  wrote:
>
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
>
> Thank you for your reply. in answer to Mike's questions...
>
> Our serial partition nodes are partially shared by the high memory partition. 
> That is, the partitions overlap partially -- shared nodes move one way or 
> another depending upon demand. Jobs requesting up to and including 20 cores 
> are routed to the serial queue. The serial nodes are shared resources. In 
> other words, jobs from different users can share the nodes. The maximum time 
> for serial jobs is 60 hours.
>
> Overtime there hasn't been any particular change in the time that users are 
> requesting. Likewise I'm convinced that the overall job size spread is the 
> same over time. What has changed is the increase in the number of smaller 
> jobs. That is, one node jobs that are exclusive (can't be routed to the 
> serial queue) or that require more then 20 cores, and also jobs requesting up 
> to 10/15 nodes (let's say). The user base has increased dramatically over the 
> last 6 months or so.
>
> This over population is leading to the delay in scheduling the larger jobs. 
> Given the size of the cluster we may need to make decisions regarding which 
> types of jobs we allow to "dominate" the system. The larger jobs at the 
> expense of the small fry for example, however that is a difficult decision 
> that means that someone has got to wait longer for results..
>
> Best regards,
> David
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 Janu

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
I missed reading what size your cluster was at first, but found it on a second 
read. Our cluster and typical maximum job size scales about the same way, 
though (our users’ typical job size is anywhere from a few cores up to 10% of 
our core count).

There are several recommendations to separate your priority weights by an order 
of magnitude or so. Our weights are dominated by fairshare, and we effectively 
ignore all other factors.

We also put TRES limits on by default, so that users can’t queue-stuff beyond a 
certain limit (any jobs totaling under around 1 cluster-day can be in a running 
or queued state, and anything past that is ignored until their running jobs 
burn off some of their time). This allows other users’ jobs to have a chance to 
run if resources are available, even if they were submitted well after the 
heavy users’ blocked jobs.

We also make extensive use of the backfill scheduler to run small, short jobs 
earlier than their queue time might allow, if and only if they don’t delay 
other jobs. If a particularly large job is about to run, we can see the nodes 
gradually empty out, which opens up lots of capacity for very short jobs.

Overall, our average wait times since September 2017 haven’t exceeded 90 hours 
for any job size, and I’m pretty sure a *lot* of that wait is due to a few 
heavy users submitting large numbers of jobs far beyond the TRES limit. Even 
our jobs of 5-10% cluster size have average start times of 60 hours or less 
(and we've managed under 48 hours for those size jobs for all but 2 months of 
that period), but those larger jobs tend to be run by our lighter users, and 
they get a major improvement to their queue time due to being far below their 
fairshare target.

We’ve been running at >50% capacity since May 2018, and >60% capacity since 
December 2018, and >80% capacity since February 2019. So our wait times aren’t 
due to having a ton of spare capacity for extended periods of time.

Not sure how much of that will help immediately, but it may give you some ideas.

> On Jan 31, 2020, at 10:14 AM, David Baker  wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> Hello,
> 
> Thank you for your reply. in answer to Mike's questions...
> 
> Our serial partition nodes are partially shared by the high memory partition. 
> That is, the partitions overlap partially -- shared nodes move one way or 
> another depending upon demand. Jobs requesting up to and including 20 cores 
> are routed to the serial queue. The serial nodes are shared resources. In 
> other words, jobs from different users can share the nodes. The maximum time 
> for serial jobs is 60 hours. 
> 
> Overtime there hasn't been any particular change in the time that users are 
> requesting. Likewise I'm convinced that the overall job size spread is the 
> same over time. What has changed is the increase in the number of smaller 
> jobs. That is, one node jobs that are exclusive (can't be routed to the 
> serial queue) or that require more then 20 cores, and also jobs requesting up 
> to 10/15 nodes (let's say). The user base has increased dramatically over the 
> last 6 months or so. 
> 
> This over population is leading to the delay in scheduling the larger jobs. 
> Given the size of the cluster we may need to make decisions regarding which 
> types of jobs we allow to "dominate" the system. The larger jobs at the 
> expense of the small fry for example, however that is a difficult decision 
> that means that someone has got to wait longer for results..
> 
> Best regards,
> David
> From: slurm-users  on behalf of 
> Renfro, Michael 
> Sent: 31 January 2020 13:27
> To: Slurm User Community List 
> Subject: Re: [slurm-users] Longer queuing times for larger jobs
>  
> Greetings, fellow general university resource administrator.
> 
> Couple things come to mind from my experience:
> 
> 1) does your serial partition share nodes with the other non-serial 
> partitions?
> 
> 2) what’s your maximum job time allowed, for serial (if the previous answer 
> was “yes”) and non-serial partitions? Are your users submitting particularly 
> longer jobs compared to earlier?
> 
> 3) are you using the backfill scheduler at all?
> 
> --
> Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
> 931 372-3601  / Tennessee Tech University
> 
>> On Jan 31, 2020, at 6:23 AM, David Baker  wrote:
>> 
>> Hello,
>> 
>> Our SLURM cluster is relatively small. We have 350 standard compute nodes 
>> each with 40 cores. The largest job that users  can run on the partition is 
>> one requesting 32 nodes. Our cluster is a genera

Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker
Hello,

Thank you for your reply. in answer to Mike's questions...

Our serial partition nodes are partially shared by the high memory partition. 
That is, the partitions overlap partially -- shared nodes move one way or 
another depending upon demand. Jobs requesting up to and including 20 cores are 
routed to the serial queue. The serial nodes are shared resources. In other 
words, jobs from different users can share the nodes. The maximum time for 
serial jobs is 60 hours.

Overtime there hasn't been any particular change in the time that users are 
requesting. Likewise I'm convinced that the overall job size spread is the same 
over time. What has changed is the increase in the number of smaller jobs. That 
is, one node jobs that are exclusive (can't be routed to the serial queue) or 
that require more then 20 cores, and also jobs requesting up to 10/15 nodes 
(let's say). The user base has increased dramatically over the last 6 months or 
so.

This over population is leading to the delay in scheduling the larger jobs. 
Given the size of the cluster we may need to make decisions regarding which 
types of jobs we allow to "dominate" the system. The larger jobs at the expense 
of the small fry for example, however that is a difficult decision that means 
that someone has got to wait longer for results..

Best regards,
David

From: slurm-users  on behalf of Renfro, 
Michael 
Sent: 31 January 2020 13:27
To: Slurm User Community List 
Subject: Re: [slurm-users] Longer queuing times for larger jobs

Greetings, fellow general university resource administrator.

Couple things come to mind from my experience:

1) does your serial partition share nodes with the other non-serial partitions?

2) what’s your maximum job time allowed, for serial (if the previous answer was 
“yes”) and non-serial partitions? Are your users submitting particularly longer 
jobs compared to earlier?

3) are you using the backfill scheduler at all?

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Jan 31, 2020, at 6:23 AM, David Baker  wrote:

Hello,

Our SLURM cluster is relatively small. We have 350 standard compute nodes each 
with 40 cores. The largest job that users  can run on the partition is one 
requesting 32 nodes. Our cluster is a general university research resource and 
so there are many different sizes of jobs ranging from single core jobs, that 
get routed to a serial partition via the job-submit.lua, through to jobs 
requesting 32 nodes. When we first started the service, 32 node jobs were 
typically taking in the region of 2 days to schedule -- recently queuing times 
have started to get out of hand. Our setup is essentially...

PriorityFavorSmall=NO
FairShareDampeningFactor=5
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0

PriorityWeightAge=40
PriorityWeightPartition=1000
PriorityWeightJobSize=50
PriorityWeightQOS=100
PriorityMaxAge=7-0

To try to reduce the queuing times for our bigger jobs should we potentially 
increase the PriorityWeightJobSize factor in the first instance to bump up the 
priority of such jobs? Or should we potentially define a set of QOSs which we 
assign to jobs in our job_submit.lua depending on the size of the job. In other 
words, let's say there is large QOS that give the largest jobs a higher 
priority, and also limits how many of those jobs that a single user can submit?

Your advice would be appreciated, please. At the moment these large jobs are 
not accruing a sufficiently high priority to rise above the other jobs in the 
cluster.

Best regards,
David


Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Renfro, Michael
Greetings, fellow general university resource administrator.

Couple things come to mind from my experience:

1) does your serial partition share nodes with the other non-serial partitions?

2) what’s your maximum job time allowed, for serial (if the previous answer was 
“yes”) and non-serial partitions? Are your users submitting particularly longer 
jobs compared to earlier?

3) are you using the backfill scheduler at all?

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601  / Tennessee Tech University

On Jan 31, 2020, at 6:23 AM, David Baker  wrote:

Hello,

Our SLURM cluster is relatively small. We have 350 standard compute nodes each 
with 40 cores. The largest job that users  can run on the partition is one 
requesting 32 nodes. Our cluster is a general university research resource and 
so there are many different sizes of jobs ranging from single core jobs, that 
get routed to a serial partition via the job-submit.lua, through to jobs 
requesting 32 nodes. When we first started the service, 32 node jobs were 
typically taking in the region of 2 days to schedule -- recently queuing times 
have started to get out of hand. Our setup is essentially...

PriorityFavorSmall=NO
FairShareDampeningFactor=5
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0

PriorityWeightAge=40
PriorityWeightPartition=1000
PriorityWeightJobSize=50
PriorityWeightQOS=100
PriorityMaxAge=7-0

To try to reduce the queuing times for our bigger jobs should we potentially 
increase the PriorityWeightJobSize factor in the first instance to bump up the 
priority of such jobs? Or should we potentially define a set of QOSs which we 
assign to jobs in our job_submit.lua depending on the size of the job. In other 
words, let's say there is large QOS that give the largest jobs a higher 
priority, and also limits how many of those jobs that a single user can submit?

Your advice would be appreciated, please. At the moment these large jobs are 
not accruing a sufficiently high priority to rise above the other jobs in the 
cluster.

Best regards,
David


Re: [slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread Loris Bennett
Hi David,
David Baker  writes:

> Hello,
>
> Our SLURM cluster is relatively small. We have 350 standard compute
> nodes each with 40 cores. The largest job that users can run on the
> partition is one requesting 32 nodes. Our cluster is a general
> university research resource and so there are many different sizes of
> jobs ranging from single core jobs, that get routed to a serial
> partition via the job-submit.lua, through to jobs requesting 32
> nodes. When we first started the service, 32 node jobs were typically
> taking in the region of 2 days to schedule -- recently queuing times
> have started to get out of hand. Our setup is essentially...
>
> PriorityFavorSmall=NO 
> FairShareDampeningFactor=5
> PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=7-0
>
> PriorityWeightAge=40
> PriorityWeightPartition=1000
> PriorityWeightJobSize=50
> PriorityWeightQOS=100
> PriorityMaxAge=7-0
>
> To try to reduce the queuing times for our bigger jobs should we
> potentially increase the PriorityWeightJobSize factor in the first
> instance to bump up the priority of such jobs? Or should we
> potentially define a set of QOSs which we assign to jobs in our
> job_submit.lua depending on the size of the job. In other words, let's
> say there is large QOS that give the largest jobs a higher priority,
> and also limits how many of those jobs that a single user can submit?
>
> Your advice would be appreciated, please. At the moment these large
> jobs are not accruing a sufficiently high priority to rise above the
> other jobs in the cluster.

We have always gone for the weighting approach, rather than the QOS
routing one.  I have always thought that QOS routing potentially takes
away some of the users' freedom unnecessarily.  What if some one wants
to submit a large number of 32-node jobs and is perfectly happy to wait
a (long) while?  We have QOSs with higher priorities, but with
restricted MaxWall, MaxJobs, MaxSubmit, MaxTRESPU, and users have to
request them explicitly.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



[slurm-users] Longer queuing times for larger jobs

2020-01-31 Thread David Baker
Hello,

Our SLURM cluster is relatively small. We have 350 standard compute nodes each 
with 40 cores. The largest job that users  can run on the partition is one 
requesting 32 nodes. Our cluster is a general university research resource and 
so there are many different sizes of jobs ranging from single core jobs, that 
get routed to a serial partition via the job-submit.lua, through to jobs 
requesting 32 nodes. When we first started the service, 32 node jobs were 
typically taking in the region of 2 days to schedule -- recently queuing times 
have started to get out of hand. Our setup is essentially...

PriorityFavorSmall=NO
FairShareDampeningFactor=5
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0

PriorityWeightAge=40
PriorityWeightPartition=1000
PriorityWeightJobSize=50
PriorityWeightQOS=100
PriorityMaxAge=7-0

To try to reduce the queuing times for our bigger jobs should we potentially 
increase the PriorityWeightJobSize factor in the first instance to bump up the 
priority of such jobs? Or should we potentially define a set of QOSs which we 
assign to jobs in our job_submit.lua depending on the size of the job. In other 
words, let's say there is large QOS that give the largest jobs a higher 
priority, and also limits how many of those jobs that a single user can submit?

Your advice would be appreciated, please. At the moment these large jobs are 
not accruing a sufficiently high priority to rise above the other jobs in the 
cluster.

Best regards,
David