Hi;
Because of the same reasons as you said, I don't use slurm power saving
features. I want to keep a certain number of nodes always power on and
ready to run. The Slurm settings are very limited, just SuspendExcNodes
and SuspendExcParts parameters are exist. But SuspendExcNodes totally
useless. When you set SuspendExcNodes, these nodes always open and
probably busy. When these nodes are busy, there aren't any idle nodes
for instant run.
We use a cron script to power off and on idle nodes. It keeps a certain
number of idle nodes always open. Also, it decides this certain number
according to prediction of the load of cluster, from the history of the
load of the cluster.
But, there is a problem for this approach: Slurm and the users can not
understand which nodes are down for power saving or other reasons. To
solve this issue, My spart command (https://github.com/mercanca/spart)
which using to show queues (free cpus and nodes info), have a feature
that shows power-saving nodes as idle.
This script is not written for publishing, it is very specific our
environment. But if you want to use (or just to get inspired), I can share.
Regards;
Ahmet M.
On 23.05.2022 11:03, Corentin Mercier wrote:
Hello,
I am currently trying to make energy savings on a cluster running SLURM.
I read the Power Saving guide and I found exactly what I am looking
for : SuspendTime. It allows me to shut nodes down after a certain
idle time.
However, I want to go further by keeping a small amount of nodes idle
in certain partitions in order to allow small jobs to run instantly.
For short, I want to keep a reactivity margin on certain partitions.
In the documentation, I saw that it's possible to exclude given nodes
from shutting down but I want that list to be dynamic and to keep a
certain amount of nodes idle.
Here's an example :
On partition A, there should always be 5 idle nodes available to new
clients. As clients come, those idle nodes become allocated and new
nodes need to be started in order to replace them (they'll stay idle
until allocated).
I would need to wake some nodes and update the exclusion list so
they're staying idle .
As someone else could have faced the same issue, I went on SLURM's
GitHub to check the available plugins there but I couldn't find any
that implement a dynamic reactivity margin.
So, is there a plugin that implements such mechanism ? Or should I
work with the Suspend/ResumeProgram scripts to update the
SuspendExcNodes list by hand ?
I'd be glad to hear any other existing solution too.
Regards,
C.Mercier