On 3/8/22 10:20 pm, Gerhard Strangar wrote:

With a fake license called reboot?

It's a neat idea, but I think there is a catch:

* 3 jobs start, each taking 1 license
* Other reboot jobs are all blocked
* Running reboot jobs trigger node reboot
* Running reboot jobs end when either the script exits and slurmd cleans it up before the reboot kills it, or it gets killed as NODE_FAIL when the node has been unresponsive for too long and is marked as down
* Licenses for those jobs are released
* 3 more reboot jobs start whilst the original 3 are rebooting
* 6 nodes are now rebooting
* Filesystem fall down go boom
* Also your rebooted nodes are now drained as "Node unexpectedly rebooted"

I guess you could change your Slurm config to not mark nodes as down if they stop responding and make sure the job that's launched, but that feels wrong to me.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


Reply via email to