Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
It's the association (account) limit. The problem being that lower priority jobs were backfilling (even with the builtin scheduler) around this larger job preventing it from running. I have found what looks like the solution. I need to switch to the builtin scheduler and add "assoc_limit_stop" t

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Chris Samuel
On 28/2/19 7:29 am, Michael Gutteridge wrote: 2221670 largenode sleeper.       me PD                 N/A      1 (null)               (AssocGrpCpuLimit) That says the job exceeds some policy limit you have set and so is not permitted to start, looks like you've got a limit on the number of cor

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
sprio --long shows: JOBID PARTITION USER PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOS NICE TRES ... 2203317 largenodealice110 10 0 0 0 100 0 2203318 largenodealice110 10 0 0

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
> You might want to look at BatchStartTimeout Parameter I've got that set to 300 seconds. Every so often one node here and there won't start and gets "ResumeTimeoutExceeded", but we're not seeing those associated with this situation (i.e. nothing in that state in this particular partition) > wha