We use a scavenger partition, and although we do not have the policy you
describe, it could be used in your case.
Assume you have 6 nodes (node-[0-5]) and two groups A and B.
Create partitions
partA = node-[0-2]
partB = node-[3-5]
all = node-[0-6]
Create QoSes normal and scavenger.
Allow normal Q
We set up a partition that underlies all our hardware that is
preemptable by all higher priority partitions. That way it can grab
idle cycles while permitting higher priority jobs to run. This also
allows users to do:
#SBATCH -p primarypartition,requeuepartition
So that the scheduler will se
Our MaxTime and DefaultTime are 14-days. Setting a high DefaultTime was a
convenience to our users (and the support team) but has evolved into a mistake
because it impacts backfill. Under high load we'll see small backfill jobs
take over because the estimated start and end time of "DefaultTime
Hello David,
I'm still relatively new at Slurm, but one way we handle this is that for
users/groups who have "bought in" to the cluster, we use a QOS to provide
them preemptible access to the set of resources provided by, e.g., a set
number of nodes, but not the nodes themselves. That is, in one e
Hello,
I would appreciate your advice on how to deal with this situation in Slurm,
please. If I have a set of nodes used by 2 groups, and normally each group
would each have access to half the nodes. So, I could limit each group to have
access to 3 nodes each, for example. I am trying to devise
FWIW, I define the DefaultTime as 5 minutes, which effectively means for
any "real" job that users must actually define a time. It helps users get
into that habit, because in the absence of a DefaultTime, most will not
even bother to think critically and carefully about what time limit is
actually
Yes I hadn't considered that! Thanks for the tip, Michael I shall do that.
John
On Fri, Oct 02, 2020 at 01:49:44PM +, Renfro, Michael wrote:
> Depending on the users who will be on this cluster, I'd probably adjust the
> partition to have a defined, non-infinite MaxTime, and maybe a lower
>
> The problem is with a single, specific, node: str957-bl0-03 . The same
> job script works if being allocated to another node, even with more
> ranks (tested up to 224/4 on mtx-* nodes).
Ahhh... here's where the details help. So it appears that the problem is on a
single node, and probably not a
Il 05/10/20 14:18, Riebs, Andy ha scritto:
Tks for considering my query.
> You need to provide some hints! What we know so far:
> 1. What we see here is a backtrace from (what looks like) an Open MPI/PMI-x
> backtrace.
Correct.
> 2. Your decision to address this to the Slurm mailing list sugges