[slurm-users] Controlling access to idle nodes

2020-10-06 Thread David Baker
Hello,

I would appreciate your advice on how to deal with this situation in Slurm, 
please. If I have a set of nodes used by 2 groups, and normally each group 
would each have access to half the nodes. So, I could limit each group to have 
access to 3 nodes each, for example. I am trying to devise a scheme that allows 
each group to make best use of the node always. In other words, each group 
could potentially use all the nodes (assuming they all free and the other group 
isn't using the nodes at all).

I cannot set hard and soft limits in slurm, and so I'm not sure how to make the 
situation flexible. Ideally It would be good for each group to be able to use 
their allocation and then take advantage of any idle nodes via a scavenging 
mechanism. The other group could then pre-empt the scavenger jobs and claim 
their nodes. I'm struggling with this since this seems like a two-way scavenger 
situation.

Could anyone please help? I have, by the way, set up partition-based 
pre-emption in the cluster. This allows the general public to scavenge nodes 
owned by research groups.

Best regards,
David




Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Jason Simms
Hello David,

I'm still relatively new at Slurm, but one way we handle this is that for
users/groups who have "bought in" to the cluster, we use a QOS to provide
them preemptible access to the set of resources provided by, e.g., a set
number of nodes, but not the nodes themselves. That is, in one example, two
researchers each have priority preemptible access to up to 52 cores in the
cluster, but those cores can come from any physical node. I set the
priority of the QOS for each researcher equal, such that they cannot
preempt each other.

Admittedly, this works best and most simply in a situation where your nodes
are relatively homogeneous, as ours currently are. I am trying to avoid a
situation where a given physical node is restricted to a specific
researcher/group, as I want all nodes, as much as possible, to be available
to all users, precisely so that idle cycles don't go to waste. It aligns
with the general philosophy that nodes are more like cattle and less like
pets, in my opinion, so I try to treat them like a giant shared pool rather
than multiple independent, gated systems.

Anyway, I suspect other users here with more experience might have a
different, or better, approach and I look forward to hearing their thoughts
as well.

Warmest regards,
Jason

On Tue, Oct 6, 2020 at 11:12 AM David Baker  wrote:

> Hello,
>
> I would appreciate your advice on how to deal with this situation in
> Slurm, please. If I have a set of nodes used by 2 groups, and normally each
> group would each have access to half the nodes. So, I could limit each
> group to have access to 3 nodes each, for example. I am trying to devise a
> scheme that allows each group to make best use of the node always. In other
> words, each group could potentially use all the nodes (assuming they all
> free and the other group isn't using the nodes at all).
>
> I cannot set hard and soft limits in slurm, and so I'm not sure how to
> make the situation flexible. Ideally It would be good for each group to be
> able to use their allocation and then take advantage of any idle nodes via
> a scavenging mechanism. The other group could then pre-empt the scavenger
> jobs and claim their nodes. I'm struggling with this since this seems like
> a two-way scavenger situation.
>
> Could anyone please help? I have, by the way, set up partition-based
> pre-emption in the cluster. This allows the general public to scavenge
> nodes owned by research groups.
>
> Best regards,
> David
>
>
>

-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632


Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Paul Edmon
We set up a partition that underlies all our hardware that is 
preemptable by all higher priority partitions.  That way it can grab 
idle cycles while permitting higher priority jobs to run. This also 
allows users to do:


#SBATCH -p primarypartition,requeuepartition

So that the scheduler will select which one their job will run on more 
quickly.  Then we rely on fairshare to adjudicate priority.


-Paul Edmon-

On 10/6/2020 11:37 AM, Jason Simms wrote:

Hello David,

I'm still relatively new at Slurm, but one way we handle this is that 
for users/groups who have "bought in" to the cluster, we use a QOS to 
provide them preemptible access to the set of resources provided by, 
e.g., a set number of nodes, but not the nodes themselves. That is, in 
one example, two researchers each have priority preemptible access to 
up to 52 cores in the cluster, but those cores can come from any 
physical node. I set the priority of the QOS for each researcher 
equal, such that they cannot preempt each other.


Admittedly, this works best and most simply in a situation where your 
nodes are relatively homogeneous, as ours currently are. I am trying 
to avoid a situation where a given physical node is restricted to a 
specific researcher/group, as I want all nodes, as much as possible, 
to be available to all users, precisely so that idle cycles don't go 
to waste. It aligns with the general philosophy that nodes are more 
like cattle and less like pets, in my opinion, so I try to treat them 
like a giant shared pool rather than multiple independent, gated systems.


Anyway, I suspect other users here with more experience might have a 
different, or better, approach and I look forward to hearing their 
thoughts as well.


Warmest regards,
Jason

On Tue, Oct 6, 2020 at 11:12 AM David Baker > wrote:


Hello,

I would appreciate your advice on how to deal with this situation
in Slurm, please. If I have a set of nodes used by 2 groups, and
normally each group would each have access to half the nodes. So,
I could limit each group to have access to 3 nodes each, for
example. I am trying to devise a scheme that allows each group to
make best use of the node always. In other words, each group could
potentially use all the nodes (assuming they all free and the
other group isn't using the nodes at all).

I cannot set hard and soft limits in slurm, and so I'm not sure
how to make the situation flexible. Ideally It would be good for
each group to be able to use their allocation and then take
advantage of any idle nodes via a scavenging mechanism. The other
group could then pre-empt the scavenger jobs and claim their
nodes. I'm struggling with this since this seems like a two-way
scavenger situation.

Could anyone please help? I have, by the way, set up
partition-based pre-emption in the cluster. This allows the
general public to scavenge nodes owned by research groups.

Best regards,
David




--
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632


Re: [slurm-users] Controlling access to idle nodes

2020-10-06 Thread Thomas M. Payerle
We use a scavenger partition, and although we do not have the policy you
describe, it could be used in your case.

Assume you have 6 nodes (node-[0-5]) and two groups A and B.
Create partitions
partA = node-[0-2]
partB = node-[3-5]
all = node-[0-6]

Create QoSes normal and scavenger.
Allow normal QoS to preempt jobs with scavenger QoS

In sacctmgr, give members of group A access to use partA with normal QoS
and group B access to use partB with normal QoS
Allow both A and B to use part all with scavenger QoS.

So members of A can launch jobs on partA with normal QoS (probably want to
make that their default), and similarly member of B can launch jobs on
partB with normal QoS.
But membes of A can also launch jobs on partB with scavenger QoS and vica
versa.  If the partB nodes used by A are needed by B, they will get
preempted.

This is not automatic (users need to explicitly say they want to run jobs
on the other half of the cluster), but that is probably reasonable because
there are some jobs one does not wish to get preempted even if they have to
wait a while in the queue to ensure such.

On Tue, Oct 6, 2020 at 11:12 AM David Baker  wrote:

> Hello,
>
> I would appreciate your advice on how to deal with this situation in
> Slurm, please. If I have a set of nodes used by 2 groups, and normally each
> group would each have access to half the nodes. So, I could limit each
> group to have access to 3 nodes each, for example. I am trying to devise a
> scheme that allows each group to make best use of the node always. In other
> words, each group could potentially use all the nodes (assuming they all
> free and the other group isn't using the nodes at all).
>
> I cannot set hard and soft limits in slurm, and so I'm not sure how to
> make the situation flexible. Ideally It would be good for each group to be
> able to use their allocation and then take advantage of any idle nodes via
> a scavenging mechanism. The other group could then pre-empt the scavenger
> jobs and claim their nodes. I'm struggling with this since this seems like
> a two-way scavenger situation.
>
> Could anyone please help? I have, by the way, set up partition-based
> pre-emption in the cluster. This allows the general public to scavenge
> nodes owned by research groups.
>
> Best regards,
> David
>
>
>

-- 
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831


Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker
Thank you very much for your comments. Oddly enough, I came up with the 
3-partition model as well once I'd sent my email. So, your comments helped to 
confirm that I was thinking on the right lines.

Best regards,
David


From: slurm-users  on behalf of Thomas 
M. Payerle 
Sent: 06 October 2020 18:50
To: Slurm User Community List 
Subject: Re: [slurm-users] Controlling access to idle nodes

We use a scavenger partition, and although we do not have the policy you 
describe, it could be used in your case.

Assume you have 6 nodes (node-[0-5]) and two groups A and B.
Create partitions
partA = node-[0-2]
partB = node-[3-5]
all = node-[0-6]

Create QoSes normal and scavenger.
Allow normal QoS to preempt jobs with scavenger QoS

In sacctmgr, give members of group A access to use partA with normal QoS  and 
group B access to use partB with normal QoS
Allow both A and B to use part all with scavenger QoS.

So members of A can launch jobs on partA with normal QoS (probably want to make 
that their default), and similarly member of B can launch jobs on partB with 
normal QoS.
But membes of A can also launch jobs on partB with scavenger QoS and vica 
versa.  If the partB nodes used by A are needed by B, they will get preempted.

This is not automatic (users need to explicitly say they want to run jobs on 
the other half of the cluster), but that is probably reasonable because there 
are some jobs one does not wish to get preempted even if they have to wait a 
while in the queue to ensure such.

On Tue, Oct 6, 2020 at 11:12 AM David Baker 
mailto:d.j.ba...@soton.ac.uk>> wrote:
Hello,

I would appreciate your advice on how to deal with this situation in Slurm, 
please. If I have a set of nodes used by 2 groups, and normally each group 
would each have access to half the nodes. So, I could limit each group to have 
access to 3 nodes each, for example. I am trying to devise a scheme that allows 
each group to make best use of the node always. In other words, each group 
could potentially use all the nodes (assuming they all free and the other group 
isn't using the nodes at all).

I cannot set hard and soft limits in slurm, and so I'm not sure how to make the 
situation flexible. Ideally It would be good for each group to be able to use 
their allocation and then take advantage of any idle nodes via a scavenging 
mechanism. The other group could then pre-empt the scavenger jobs and claim 
their nodes. I'm struggling with this since this seems like a two-way scavenger 
situation.

Could anyone please help? I have, by the way, set up partition-based 
pre-emption in the cluster. This allows the general public to scavenge nodes 
owned by research groups.

Best regards,
David




--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu<mailto:paye...@umd.edu>
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831