Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Brian Andrus Wed, 03 Aug 2022 14:22:55 -0700


So an example of using slurm to reboot all nodes 3 at a time:


    sinfo -h -o %n|xargs --max-procs=3 scontrol reboot {}

If you want to get fancy, make a script that does the reboot and waitsfor the node to be back up before exiting and use that instead of the'scontrol reboot' part.


Brian Andrus

On 8/3/2022 11:47 AM, Benjamin Arntzen wrote:

At risk of being a heretic, why not something like Ansible to handlethis? Slurm "should" be able to do it but feels like a bit of a weirdfit for the job.
------------------------------------------------------------------------
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalfof Phil Chiu <whophilc...@gmail.com>
*Sent:* Wednesday, 3 August 2022, 5:51 pm
*To:* slurm-us...@schedmd.com <slurm-us...@schedmd.com>
*Subject:* [slurm-users] Rolling reboot with at most N machines downsimultaneously?
Occasionally I need to all the compute nodes in my system. However, Ihave a parallel file system which is /converged/, i.e., each computenode contributes a disk to the file system. The file system cantolerate having N nodes down simultaneously.
Therefore my problem is this - "Reboot all nodes, permitting N nodesto be rebooting simultaneously."
I have thought about the following options

  * A mass scontrol reboot - It doesn't seem like there is the ability
    to control how many nodes are being rebooted at once.
  * A job array - Job arrays can be easily configured to allow at most
    N jobs to be running simultaneously. However, I would need each
    array task to execute on a specific node, which does not appear to
    be possible.
  * Individual slurm jobs which reboot nodes - With a for loop, I
    could submit a reboot job for each node. But I'm not sure how to
    limit this so at most N jobs are running simultaneously. Perhaps a
    special partition is needed for this?

Open to hearing any other ideas.

Thanks!
Phil

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

Reply via email to