Hello, We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot command. I find that compute nodes reboot, but they are not returned to service. Rather they remain down following the reboot..
navy55 1 debug* down 80 2:20:2 192000 0 2000 (null) Reboot ASAP : reboot This is a diskfull node and so it doesn't take too long to reboot. For the sake of the argument I have set ResumeTimeOut to 1000 seconds which is well over what's needed... [root@navy51 slurm]# grep -i resume slurm.conf ResumeTimeout=1000 [root@navy51 slurm]# grep -i return slurm.conf ReturnToService=0 [root@navy51 slurm]# grep -i nhc slurm.conf # LBNL Node Health Check (NHC) #HealthCheckProgram=/usr/sbin/nhc For this experiment I have disabled the health checker, and I don't think setting ReturnToService=1 helps. Could anyone please help with this? We are about to update the node firmware and ensuring that the nodes are returned to service following their reboot would be useful. Best regards, David