Hi;

Because of the same reasons as you said, I don't use slurm power saving features. I want to keep a certain number of nodes always power on and ready to run. The Slurm settings are very limited, just SuspendExcNodes and SuspendExcParts parameters are exist. But SuspendExcNodes totally useless. When you set SuspendExcNodes, these nodes always open and probably busy. When these nodes are busy, there aren't any idle nodes for instant run.

We use a cron script to power off and on idle nodes. It keeps a certain number of idle nodes always open.  Also, it decides this certain number according to prediction of the load of cluster, from the history of the load of the cluster.

But, there is a problem for this approach: Slurm and the users can not understand which nodes are down for power saving or other reasons. To solve this issue, My spart command (https://github.com/mercanca/spart) which using to show queues (free cpus and nodes info), have a feature that shows power-saving nodes as idle.

This script is not written for publishing, it is very specific our environment. But if you want to use (or just to get inspired), I can share.

Regards;

Ahmet M.



On 23.05.2022 11:03, Corentin Mercier wrote:
Hello,

I am currently trying to make energy savings on a cluster running SLURM.

I read the Power Saving guide and I found exactly what I am looking for : SuspendTime. It allows me to shut nodes down after a certain idle time. However, I want to go further by keeping a small amount of nodes idle in certain partitions in order to allow small jobs to run instantly.
For short, I want to keep a reactivity margin on certain partitions.

In the documentation, I saw that it's possible to exclude given nodes from shutting down but I want that list to be dynamic and to keep a certain amount of nodes idle.
Here's an example :
On partition A, there should always be 5 idle nodes available to new clients. As clients come, those idle nodes become allocated and new nodes need to be started in order to replace them (they'll stay idle until allocated). I would need to wake some nodes and update the exclusion list so they're staying idle .

As someone else could have faced the same issue, I went on SLURM's GitHub to check the available plugins there but I couldn't find any that implement a dynamic reactivity margin.

So, is there a plugin that implements such mechanism ? Or should I work with the Suspend/ResumeProgram scripts to update the SuspendExcNodes list by hand ?
I'd be glad to hear any other existing solution too.

Regards,
C.Mercier

Reply via email to