[slurm-users] scontrol reboot issue

Marcus Wagner Mon, 29 Apr 2019 01:57:33 -0700

Dear all,

we use "scontrol reboot asap reason=<whatever reason it is>nextstate=resume" to e.g. do a reboot after a kernel update.

But I must say, that works SOMETIMES. Often SLURM forgets that there isa maintenance for a node and therefore does not reboot the node:



ncg01            DRAINING        Kernel-Update [root@2019-04-26T08:51:43]
ncg02            DRAINED         Kernel-Update [root@2019-04-26T08:51:43]
ncg04            DRAINED         Kernel-Update [root@2019-04-26T08:51:43]
ncg07            DRAINING@       Kernel-Update [root@2019-04-26T08:51:43]
ncg08            DRAINING        Kernel-Update [root@2019-04-26T08:51:43]
ncg10            DRAINED         Kernel-Update [root@2019-04-26T08:51:43]

As you can see, ncg07 is still draining, the "@" means, the reboot isstill pending.ncg01 and ncg08 are still draining, but slurm forgot about the pendingreboot (no "@" sign).

ncg02, ncg04 and ncg10 are already drained, but do not get rebooted.

ncg03 (not seen here) got drained and rebooted.

The following is a small excerpt from the slurmctld log:

[2019-04-26T08:51:43.267] reboot request queued for nodes ncg04
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg02
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg03
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg10
[2019-04-26T08:51:43.267] reboot request queued for nodes ncg07
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg08
[2019-04-26T08:51:43.271] reboot request queued for nodes ncg01
[2019-04-26T09:21:31.053] node ncg03 returned to service
[2019-04-26T11:53:15.937] Node ncg08 now responding
[2019-04-26T11:53:15.937] Node ncg02 now responding
[2019-04-26T11:56:51.565] Node ncg01 now responding
[2019-04-26T11:56:51.565] Node ncg10 now responding
[2019-04-26T11:56:51.565] Node ncg08 now responding
[2019-04-26T11:56:51.565] Node ncg04 now responding
[2019-04-26T11:56:51.565] Node ncg03 now responding
[2019-04-29T09:44:17.839] node ncg10 returned to service
[2019-04-29T09:44:21.102] node ncg02 returned to service
[2019-04-29T09:44:32.394] node ncg04 returned to service
[2019-04-29T10:20:22.557] update_node: node ncg02 state set to IDLE
[2019-04-29T10:20:32.553] update_node: node ncg04 state set to IDLE
[2019-04-29T10:20:51.897] update_node: node ncg10 state set to IDLE

Today at about 09:40 I reissued a reboot for ncg02, ncg04 and ncg10.This one, as these nodes were already drained, slurmctld issued thereboot and the nodes are now up again.

Does anyone has similar issues, or a clue, where this behaviour mightcome from?



Best
Marcus


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

[slurm-users] scontrol reboot issue

Reply via email to