Re: [slurm-users] Simple free for all cluster

Chris Samuel Sat, 10 Oct 2020 16:21:34 -0700

On Tuesday, 6 October 2020 7:53:02 AM PDT Jason Simms wrote:

> I currently don't have a MaxTime defined, because how do I know how long a
> job will take? Most jobs on my cluster require no more than 3-4 days, but
> in some cases at other campuses, I know that jobs can run for weeks. I
> suppose even setting a time limit such as 4 weeks would be overkill, but at
> least it's not infinite. I'm curious what others use as that value, and how
> you arrived at it


My journey over the last 16 years in HPC has been one of decreasing time 
limits, back in 2003 with VPAC's first Linux cluster we had no time limits, we 
then introduced a 90 day limit so we could plan quarterly maintenances (and 
yes, we had users who had jobs which legitimately ran longer than that, so 
they had to learn to checkpoint).  At VLSCI we had 30 day limits (life 
sciences, so many long running poorly scaling jobs), then when I was at 
Swinburne it was a 7 day limit, and now here at NERSC we've got 2 day limits.

It really is down to what your use cases are and how much influence you have 
over your users.  It's often the HPC sysadmins responsibility to try and find 
that balance between good utilisation, effective use of the system and reaching 
the desired science/research/development outcomes.

Best of luck!
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] Simple free for all cluster

Reply via email to