Hello,

We are running Slurm v19.05.5 and I am experimenting with the scontrol reboot 
command. I find that compute nodes reboot, but they are not returned to 
service. Rather they remain down following the reboot..

navy55         1    debug*        down   80   2:20:2 192000        0   2000   
(null) Reboot ASAP : reboot

This is a diskfull node and so it doesn't take too long to reboot. For the sake 
of the argument I have set ResumeTimeOut to 1000 seconds which is well over 
what's needed...

[root@navy51 slurm]# grep -i resume slurm.conf
ResumeTimeout=1000
[root@navy51 slurm]# grep -i return slurm.conf
ReturnToService=0
[root@navy51 slurm]# grep -i nhc slurm.conf
# LBNL Node Health Check (NHC)
#HealthCheckProgram=/usr/sbin/nhc

For this experiment I have disabled the health checker, and I don't think 
setting ReturnToService=1 helps. Could anyone please help with this? We are 
about to update the node firmware and ensuring that the nodes are returned to 
service following their reboot would be useful.

Best regards,
David

Reply via email to