Dear all,

we use "scontrol reboot asap reason=<whatever reason it is> nextstate=resume" to e.g. do a reboot after a kernel update.

But I must say, that works SOMETIMES. Often SLURM forgets that there is a maintenance for a node and therefore does not reboot the node:


ncg01            DRAINING        Kernel-Update [root@2019-04-26T08:51:43]
ncg02            DRAINED         Kernel-Update [root@2019-04-26T08:51:43]
ncg04            DRAINED         Kernel-Update [root@2019-04-26T08:51:43]
ncg07            DRAINING@       Kernel-Update [root@2019-04-26T08:51:43]
ncg08            DRAINING        Kernel-Update [root@2019-04-26T08:51:43]
ncg10            DRAINED         Kernel-Update [root@2019-04-26T08:51:43]

As you can see, ncg07 is still draining, the "@" means, the reboot is still pending. ncg01 and ncg08 are still draining, but slurm forgot about the pending reboot (no "@" sign).
ncg02, ncg04 and ncg10 are already drained, but do not get rebooted.

ncg03 (not seen here) got drained and rebooted.

The following is a small excerpt from the slurmctld log:

[2019-04-26T08:51:43.267] reboot request queued for nodes ncg04
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg02
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg03
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg10
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg07
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg08
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg01
[2019-04-26T09:21:31.053] node ncg03 returned to service
[2019-04-26T11:53:15.937] Node ncg08 now responding
[2019-04-26T11:53:15.937] Node ncg02 now responding
[2019-04-26T11:56:51.565] Node ncg01 now responding
[2019-04-26T11:56:51.565] Node ncg10 now responding
[2019-04-26T11:56:51.565] Node ncg08 now responding
[2019-04-26T11:56:51.565] Node ncg04 now responding
[2019-04-26T11:56:51.565] Node ncg03 now responding
[2019-04-29T09:44:17.839] node ncg10 returned to service
[2019-04-29T09:44:21.102] node ncg02 returned to service
[2019-04-29T09:44:32.394] node ncg04 returned to service
[2019-04-29T10:20:22.557] update_node: node ncg02 state set to IDLE
[2019-04-29T10:20:32.553] update_node: node ncg04 state set to IDLE
[2019-04-29T10:20:51.897] update_node: node ncg10 state set to IDLE



Today at about 09:40 I reissued a reboot for ncg02, ncg04 and ncg10. This one, as these nodes were already drained, slurmctld issued the reboot and the nodes are now up again.


Does anyone has similar issues, or a clue, where this behaviour might come from?


Best
Marcus


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Reply via email to