Re: [slurm-users] SLURM docs: HTML title should be same as page title

2019-02-27 Thread Chris Samuel
On Monday, 25 February 2019 2:55:44 AM PST Patrice Peterson wrote: > Filed a bug: https://bugs.schedmd.com/show_bug.cgi?id=6573 Looks like Danny fixed it in git. https://github.com/SchedMD/slurm/commit/b1c78d9934ef461df637c57c001eb165a6b1fcc3 -- Chris Samuel : http://www.csamuel.org/ :

Re: [slurm-users] sacct end time for failed jobs

2019-02-27 Thread Chris Samuel
On Tuesday, 26 February 2019 10:03:34 AM PST Brian Andrus wrote: > One thing I have noticed is that the END field for jobs with a state of > FAILED is "Unknown" but the ELAPSED field has the time it ran. That shouldn't happen, it works fine here (and where I've used Slurm in Australia). $

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Chris Samuel
On Wednesday, 27 February 2019 1:08:56 PM PST Michael Gutteridge wrote: > Yes, we do have time limits set on partitions- 7 days maximum, 3 days > default. In this case, the larger job is requesting 3 days of walltime, > the smaller jobs are requesting 7. It sounds like no forward reservation is

Re: [slurm-users] 转发: a heterogeneous job terminate unexpectedly

2019-02-27 Thread Chris Samuel
On Wednesday, 27 February 2019 5:06:37 PM PST hu...@sugon.com wrote: > I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2 > cpus and each cpu has 32cores, but when I submitted a heterogeneous job > twice ,the second job terminated unexpectedly. Does this work if you use

[slurm-users] 转发: a heterogeneous job terminate unexpectedly

2019-02-27 Thread hu...@sugon.com
Dear there, I have a cluster with 9 nodes(cmbc[1530-1538]) , each node has 2 cpus and each cpu has 32cores, but when I submitted a heterogeneous job twice ,the second job terminated unexpectedly. This problem has been bothering me all day. Slurm version is 18.08.5 and here is the job

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle
I am not very familiar with the Slurm power saving stuff. You might want to look at BatchStartTimeout Parameter (See e.g. https://slurm.schedmd.com/power_save.html) Otherwise, what state are the Slurm power saving powered-down nodes in when powered-down? From man pages sounds like should be

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
> You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is being violated. Yeah, sorry. I'm trying to balance the amount of information and likely skewed too concise 8-/ The partition looks like: PartitionName=largenode

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
Yes, we do have time limits set on partitions- 7 days maximum, 3 days default. In this case, the larger job is requesting 3 days of walltime, the smaller jobs are requesting 7. Thanks M On Wed, Feb 27, 2019 at 12:41 PM Andy Riebs wrote: > Michael, are you setting time limits for the jobs?

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Thomas M. Payerle
The "JobId=2210784 delayed for accounting policy is likely the key as it indicates the job is currently unable to run, so the lower priority smaller job bumps ahead of it. You have not provided enough information (cluster configuration, job information, etc) to diagnose what accounting policy is

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Andy Riebs
Michael, are you setting time limits for the jobs? That's a huge part of a scheduler's decision about whether another job can be run. For example, if a job is running with the Slurm default of "infinite," the scheduler will likely decide that jobs that will fit in the remaining nodes will be

[slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Michael Gutteridge
I've run into a problem with a cluster we've got in a cloud provider- hoping someone might have some advice. The problem is that I've got a circumstance where large jobs _never_ start... or more correctly, that large-er jobs don't start when there are many smaller jobs in the partition. In this

Re: [slurm-users] Different slurm.conf for master and nodes

2019-02-27 Thread Michael Gutteridge
Hi I don't know what version of Slurm you're using or how it may be different from the one I'm using (18.05), but here's my understanding of memory limits and what I'm seeing on our cluster. The parameter `JobAcctGatherParams=OverMemoryKill` controls whether a step is killed if it goes over the

Re: [slurm-users] Fairshare - root user

2019-02-27 Thread Antony Cleave
I think If you increase the share of mygroup to something like 999 then the share that the root user gets will drop by a factor of 1000 pretty sure I've seen this before and that's how I fixed it Antony On Wed, 27 Feb 2019 at 13:47, Will Dennis wrote: > Looking at output of 'sshare", I see: >

Re: [slurm-users] Fairshare - root user

2019-02-27 Thread Marcus Wagner
Hi Will, as long as you do not submit massive number of jobs as root, there should be no problem. This is only a priority thing, so root will have a fairly high priority, it does not mean, the users can only use half of your cluster. Best Marcus On 2/27/19 2:43 PM, Will Dennis wrote:

[slurm-users] Fairshare - root user

2019-02-27 Thread Will Dennis
Looking at output of 'sshare", I see: root@myserver:~# sshare -l Account User RawShares NormSharesRawUsage NormUsage EffectvUsage FairShare -- -- --- --- --- - -- root