On 06-08-2020 19:13, Jason Simms wrote:
Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into a DRAIN state.

I'm not sure it really matters either way, but is there any preference one way or the other? Any gotchas I should be aware of?

I'd recommend using a reservation because you can define a specific maintenance period way ahead of time. You ought to create the reservation in advance, before the greatest MaxTime for all partitions in slurm.conf, so that you won't have any remaining running jobs when the reservation sets in. Jobs can then continue to run until the very last minute!

I have some notes on reservations in
https://wiki.fysik.dtu.dk/niflheim/SLURM#resource-reservation

Draining nodes is a bad idea, IMHO, because you'll have a lot of drained nodes from now and until your maintenance period, causing lost resources.

The way I prefer to do upgrades is actually neither 1) nor 2). I make rolling (minor) upgrades of the compute node OS and firmware while the cluster is in full production in order to avoid lost resources. I will post my upgrade script to this list in a separate message.

/Ole

Reply via email to