Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
It's the association (account) limit. The problem being that lower priority jobs were backfilling (even with the builtin scheduler) around this larger job preventing it from running. I have found what looks like the solution. I need to switch to the builtin scheduler and add "assoc_limit_stop" t

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Chris Samuel
On 28/2/19 7:29 am, Michael Gutteridge wrote: 2221670 largenode sleeper.       me PD                 N/A      1 (null)               (AssocGrpCpuLimit) That says the job exceeds some policy limit you have set and so is not permitted to start, looks like you've got a limit on the number of cor

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
sprio --long shows: JOBID PARTITION USER PRIORITYAGE FAIRSHAREJOBSIZE PARTITION QOS NICE TRES ... 2203317 largenodealice110 10 0 0 0 100 0 2203318 largenodealice110 10 0 0

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-28 Thread Michael Gutteridge
> You might want to look at BatchStartTimeout Parameter I've got that set to 300 seconds. Every so often one node here and there won't start and gets "ResumeTimeoutExceeded", but we're not seeing those associated with this situation (i.e. nothing in that state in this particular partition) > wha

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Chris Samuel
On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote: > Yes, we do have time limits set on partitions- 7 days maximum, 3 days > default. In this case, the larger job is requesting 3 days of walltime, > the smaller jobs are requesting 7. It sounds like no forward reservation is

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle
I am not very familiar with the Slurm power saving stuff. You might want to look at BatchStartTimeout Parameter (See e.g. https://slurm.schedmd.com/power_save.html) Otherwise, what state are the Slurm power saving powered-down nodes in when powered-down? From man pages sounds like should be idle

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
> You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is being violated. Yeah, sorry. I'm trying to balance the amount of information and likely skewed too concise 8-/ The partition looks like: PartitionName=largenode Allo

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
fore the job that requires infinite time. > > Andy > > -- > *From:* Michael Gutteridge > > *Sent:* Wednesday, February 27, 2019 3:29PM > *To:* Slurm User Community List > > *Cc:* > *Subject:* [slurm-users] Large job starvation on c

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle
The "JobId=2210784 delayed for accounting policy is likely the key as it indicates the job is currently unable to run, so the lower priority smaller job bumps ahead of it. You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is be

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Andy Riebs
ining nodes will be able to finish before the job that requires infinite time. Andy *From:* Michael Gutteridge *Sent:* Wednesday, February 27, 2019 3:29PM *To:* Slurm User Community List *Cc:* *Subject:* [slurm-users] Lar

[slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
I've run into a problem with a cluster we've got in a cloud provider- hoping someone might have some advice. The problem is that I've got a circumstance where large jobs _never_ start... or more correctly, that large-er jobs don't start when there are many smaller jobs in the partition. In this c