Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

Christopher Samuel Tue, 24 Oct 2023 17:12:40 -0700

On 10/24/23 12:39, Tim Schneider wrote:

Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME<node>", the node goes in "mix@" state (not drain), but no new jobs getscheduled until the node reboots. Essentially I get draining behavior,even though the node's state is not "drain". Note that this behavior iscaused by "nextstate=RESUME"; if I leave that away, jobs get scheduledas expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let anymore jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how ournodes are managed) we would build new images with the latest softwarestack, test them on a separate test system and then once happy bringthem over to the production system and do an "scontrol reboot ASAPnextstate=resume reason=... $NODES" to ensure that from that pointonwards no new jobs would start in the old software configuration, onlythe new one.

Also slurmctld would know that these nodes are due to come back in"ResumeTimeout" seconds after the reboot is issued and so could plan forthem as part of scheduling large jobs, rather than thinking there was noway it could do so and letting lots of smaller jobs get in the way.


Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

Reply via email to