Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Paul Edmon
We've been using a backfill priority partition for people doing HTC 
work.  We have requeue set so that jobs from the high priority 
partitions can take over.


You can do this for your interactive nodes as well if you want. We 
dedicate hardware to interactive work and use Partition based QoS's to 
limit usage.


-Paul Edmon-


On 05/08/2018 10:08 AM, Renfro, Michael wrote:

That’s the first limit I placed on our cluster, and it has generally worked out 
well (never used a job limit). A single account can get 1000 CPU-days in 
whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS 
for times when the cluster is mostly idle, but a few users have jobs that run 
past the TRES limit. But I really like the idea of a preemptable QOS that the 
users can put their extra jobs into on their own.






Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Renfro, Michael
That’s the first limit I placed on our cluster, and it has generally worked out 
well (never used a job limit). A single account can get 1000 CPU-days in 
whatever distribution they want. I’ve just added a root-only ‘expedited’ QOS 
for times when the cluster is mostly idle, but a few users have jobs that run 
past the TRES limit. But I really like the idea of a preemptable QOS that the 
users can put their extra jobs into on their own.

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On May 8, 2018, at 2:37 AM, Yair Yarom  wrote:
> 
> we are considering setting up maximum allowed TRES resources, and not number 
> of jobs.



Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen

On 05/08/2018 09:49 AM, John Hearns wrote:
Actually what IS bad is users not putting cluster resources to good use. 
You can often see jobs which are 'stalled'  - ie the nodes are reserved 
for the job,
but the internal logic of the job has failed and the executables have 
not launched. Or maybe some user is running an interactive job and has 
wandered
off for coffee/beer/an extended holiday.  It is well worth scanning for 
stalled jobs and terminating them.


I agree, and the way I monitor our cluster for jobs that do little or no 
useful work is through my small utility "pestat" available from 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat


I run "pestat -F" many times every day to spot inefficient jobs.

If I want to list the user processes belonging to a job, I use "psjob 
".  I notify users and possibly cancel their jobs using the 
"notifybadjob " script.  These tools are available at 
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs


/Ole



Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread John Hearns
"Otherwise a user can have a sing le job that takes the entire cluster,
and insidesplit it up the way he wants to."
Yair, I agree. That is what I was referring to regardign interactive jobs.
Perhaps not a user reserving the entire cluster,
but a use reserving a lot of compute nodes and not making sure they are
utilised fully.

On 8 May 2018 at 09:37, Yair Yarom  wrote:

> Hi,
>
> This is what we did, not sure those are the best solutions :)
>
> ## Queue stuffing
>
> We have set PriorityWeightAge several magnitudes lower than
> PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of
> older jobs. As I see it, the fairshare is far more important than age.
>
> Besides the MaxJobs that was suggested, we are considering setting up
> maximum allowed TRES resources, and not number of jobs. Otherwise a
> user can have a single job that takes the entire cluster, and inside
> split it up the way he wants to. As mentioned earlier, It will create
> an issue where jobs are pending and there are idle resources, but for
> that we have a special preempt-able "requeue" account/qos which users
> can use but the jobs there will be killed when "real" jobs arrive.
>
> ## Interactive job availability
>
> We have two partitions: short and long. They are indeed fixed where
> the short is on 100% of the cluster and the long is about 50%-80% of
> the cluster (depending on the cluster).
>
>


Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Yair Yarom
Hi,

This is what we did, not sure those are the best solutions :)

## Queue stuffing

We have set PriorityWeightAge several magnitudes lower than
PriorityWeightFairshare, and we also have PriorityMaxAge set to cap of
older jobs. As I see it, the fairshare is far more important than age.

Besides the MaxJobs that was suggested, we are considering setting up
maximum allowed TRES resources, and not number of jobs. Otherwise a
user can have a single job that takes the entire cluster, and inside
split it up the way he wants to. As mentioned earlier, It will create
an issue where jobs are pending and there are idle resources, but for
that we have a special preempt-able "requeue" account/qos which users
can use but the jobs there will be killed when "real" jobs arrive.

## Interactive job availability

We have two partitions: short and long. They are indeed fixed where
the short is on 100% of the cluster and the long is about 50%-80% of
the cluster (depending on the cluster).



Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen

On 05/08/2018 08:44 AM, Bjørn-Helge Mevik wrote:

Jonathon A Anderson  writes:


## Queue stuffing


There is the bf_max_job_user SchedulerParameter, which is sort of the
"poor man's MAXIJOB"; it limits the number of jobs from each user the
backfiller will try to start on each run.  It doesn't do exactly what
you want, but at least the backfiller will not create reservations for
_all_ the queue stuffer's jobs.


Adding to this I discuss backfilling configuration in
https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler#scheduler-configuration

The MaxJobCount limit etc. is described in
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#maxjobcount-limit

/Ole



Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Bjørn-Helge Mevik
Jonathon A Anderson  writes:

> ## Queue stuffing

There is the bf_max_job_user SchedulerParameter, which is sort of the
"poor man's MAXIJOB"; it limits the number of jobs from each user the
backfiller will try to start on each run.  It doesn't do exactly what
you want, but at least the backfiller will not create reservations for
_all_ the queue stuffer's jobs.

Also, there is the sacctmgr setting

MaxJobs=
   Maximum number of jobs each user is allowed to run at one time in this 
asso-
   ciation.   This  is  overridden  if  set directly on a user.  Default is 
the
   cluster’s limit.  To clear a previously set value  use  the  modify  
command
   with a new value of -1.

Another alternative is to get the user to use array jobs, where (s)he
can specify how many jobs are to be allowed to run at the same time, but
of course this means the user must cooperate. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


signature.asc
Description: PGP signature


Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Ryan Novosielski
One of these TRES-related ones in a QOS ought to do it:

https://slurm.schedmd.com/resource_limits.html

Your problem there, though, is you will eventually have stuff waiting to run it 
and when the system is idle. We had the same circumstance and the same eventual 
outcome.

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On May 8, 2018, at 00:43, Jonathon A Anderson 
> wrote:

We have two main issues with our scheduling policy right now. The first is an 
issue that we call "queue stuffing." The second is an issue with interactive 
job availability. We aren't confused about why these issues exist, but we 
aren't sure the best way to address them.

I'd love to hear any suggestions on how other sites address these issues. 
Thanks for any advice!


## Queue stuffing

We use multifactor scheduling to provide account-based fairshare scheduling as 
well as standard fifo-style job aging. In general, this works pretty well, and 
accounts meet their scheduling targets; however, every now and again, we have a 
user who has a relatively high-throughput (not HPC) workload that they're 
willing to wait a significant period of time for. They're low-priority work, 
but they put a few thousand jobs into the queue, and just sit and wait. 
Eventually the job aging makes the jobs so high-priority, compared to the 
fairshare, that they all _as a set_ become higher-priority than the rest of the 
work on the cluster. Since they continue to age as the other jobs continue to 
age, these jobs end up monopolizing the cluster for days at a time, as their 
high volume of relatively small jobs use up a greater and greater percentage of 
the machine.

In Moab I'd address this by limiting the number of jobs the user could have 
*eligible* at any given time; but it appears that the only option for slurm is 
limiting the number of jobs a user can *submit*, which isn't as nice a user 
experience and can lead to some pathological user behaviors (like users running 
cron jobs that wake repeatedly and submit more jobs automatically).


## Interactive job availability

I'm becoming increasingly convinced that holding some portion of our resource 
aside as dedicated for relatively short, small, interactive jobs is a unique 
good; but I'm not sure how best to implement it. My immediate thought was to 
use a reservation with the DAILY and REPLACE flags. I particularly like the 
idea of using the REPLACE flag here as we could keep a flexible amount of 
resources available irrespective of how much was actually being used for the 
purpose at any given time; but it doesn't appear that there's any way to limit 
the per-user use of resources *within* a reservation; so if we created such a 
reservation and granted all users access to it, any individual user would be 
capable of consuming all resources in the reservation anyway. I'd have a 
dedicated "interactive" qos or similar to put such restrictions on; but there 
doesn't appear to be a way to then limit the use of the reservation to only 
jobs with that qos. (Aside from job_submit scripts or similar. Please correct 
me if I'm wrong.)

In lieu of that, I'm leaning towards having a dedicated interactive partition 
that we'd manually move some resources to; but that's a bit less flexible.