[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?

Ryan Novosielski Tue, 25 Jul 2017 19:01:28 -0700

I appreciate the write ups, thanks.

Anyone using GrpCPURunMins (or any of the similar ones really) have any 
unanticipated negatives to report?


--
____
|| \UTGERS,       |---------------------------*O*---------------------------
||_// the State     |         Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \    of NJ     | Office of Advanced Research Computing - MSB C630, Newark
    `'

On Jul 24, 2017, at 19:10, Ryan Cox <ryan_...@byu.edu<mailto:ryan_...@byu.edu>> 
wrote:


Corey,

We almost exclusively use GrpCPURunMins as well as 3 or 7 day walltime limits 
depending on the partition.  For my (somewhat rambling) thoughts on the matter, 
see 
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftech.ryancox.net%2F2014%2F04%2Fscheduler-limit-remaining-cputime-per.html&data=02%7C01%7Cnovosirj%40rutgers.edu%7Cb305ada498b44a840e2f08d4d2e9200f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636365346110593269&sdata=Bi9p5aQQzVG2BEuxNLOeEu3XndlC8%2B1G5%2FCjyhx7GHU%3D&reserved=0.
 It generally works pretty well.

We also have 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmarylou.byu.edu%2Fsimulation%2Fgrpcpurunmins.php&data=02%7C01%7Cnovosirj%40rutgers.edu%7Cb305ada498b44a840e2f08d4d2e9200f%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636365346110593269&sdata=QuDrKMoI22KNqjW4LnOuZWBUhA0TvWmuoMlyYweICBU%3D&reserved=0
 to simulate various settings, though it needs some improvement such as a 
realistic maximum.

sshare -l (TRESRunMins) should have the live stats you're looking for.

Ryan

On 07/24/2017 02:39 PM, Corey Keasling wrote:

Hi Slurm-Dev,

I'm currently designing and testing what will ultimately be a small Slurm 
cluster of about 60 heterogeneous nodes (five different generations of 
hardware).  Our user-base is also diverse, with need for fast turnover of 
small, sequential jobs and for long-duration parallel codes (e.g., 16 cores for 
several months).

In the past we limited users by how many cores they could allocate at any one 
time.  This has the drawback that no distinction is made between, say, 128 
cores for 2 hours and 128 cores for 2 months.  We want users to be able to run 
on a large portion of the cluster when it is available while ensuring that they 
cannot take advantage of an idle period to start jobs which will monopolize it 
for weeks.

Limiting by GrpCPURunMins seems like a good answer.  I think of it as 
allocating computational area (i.e., cores*minutes) and not just width (cores). 
 I'd love to know if anyone has any experience or thoughts on imposing limits 
in this way.  Also, is anyone aware of a simple way to calculate remaining 
"area"?  I can use squeue or sacct to ultimately derive how much of a limit is 
in use by looking at remaining wall-time and core count, but if there's 
something built in - or pre-existing - it would be nice to know.

It's worth noting that the cluster is divided into several partitions with most 
nodes existing in several.  This is partially political (to give groups 
increased priority on nodes they helped pay for) and partially practical (to 
ensure users explicitly requesting slow nodes instead of just dumping them on 
ancient Opterons).  Also, each user gets their own Account, so the QoS Grp 
limits apply to each human separately.  Accounts would also have absolute core 
limits.

Thank you for your thoughts!

Corey


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?

Reply via email to