Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Thomas M. Payerle
We use a scavenger partition, and although we do not have the policy you describe, it could be used in your case. Assume you have 6 nodes (node-[0-5]) and two groups A and B. Create partitions partA = node-[0-2] partB = node-[3-5] all = node-[0-6] Create QoSes normal and scavenger. Allow normal Q

Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Paul Edmon
We set up a partition that underlies all our hardware that is preemptable by all higher priority partitions.  That way it can grab idle cycles while permitting higher priority jobs to run. This also allows users to do: #SBATCH -p primarypartition,requeuepartition So that the scheduler will se

Re: [slurm-users] Simple free for all cluster

2020-10-06 Thread Sebastian T Smith
Our MaxTime and DefaultTime are 14-days. Setting a high DefaultTime was a convenience to our users (and the support team) but has evolved into a mistake because it impacts backfill. Under high load we'll see small backfill jobs take over because the estimated start and end time of "DefaultTime

Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Jason Simms
Hello David, I'm still relatively new at Slurm, but one way we handle this is that for users/groups who have "bought in" to the cluster, we use a QOS to provide them preemptible access to the set of resources provided by, e.g., a set number of nodes, but not the nodes themselves. That is, in one e

[slurm-users] Controlling access to idle nodes

2020-10-06 Thread David Baker
Hello, I would appreciate your advice on how to deal with this situation in Slurm, please. If I have a set of nodes used by 2 groups, and normally each group would each have access to half the nodes. So, I could limit each group to have access to 3 nodes each, for example. I am trying to devise

Re: [slurm-users] Simple free for all cluster

2020-10-06 Thread Jason Simms
FWIW, I define the DefaultTime as 5 minutes, which effectively means for any "real" job that users must actually define a time. It helps users get into that habit, because in the absence of a DefaultTime, most will not even bother to think critically and carefully about what time limit is actually

Re: [slurm-users] Simple free for all cluster

2020-10-06 Thread John H
Yes I hadn't considered that! Thanks for the tip, Michael I shall do that. John On Fri, Oct 02, 2020 at 01:49:44PM +, Renfro, Michael wrote: > Depending on the users who will be on this cluster, I'd probably adjust the > partition to have a defined, non-infinite MaxTime, and maybe a lower >

Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-06 Thread Riebs, Andy
> The problem is with a single, specific, node: str957-bl0-03 . The same > job script works if being allocated to another node, even with more > ranks (tested up to 224/4 on mtx-* nodes). Ahhh... here's where the details help. So it appears that the problem is on a single node, and probably not a

Re: [slurm-users] Segfault with 32 processes, OK with 30 ???

2020-10-06 Thread Diego Zuccato
Il 05/10/20 14:18, Riebs, Andy ha scritto: Tks for considering my query. > You need to provide some hints! What we know so far: > 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x > backtrace. Correct. > 2. Your decision to address this to the Slurm mailing list sugges