We make use of the node health check (HealthCheckProgram in slurm.conf) to automatically put nodes online/ offline if things like mount points are not available. If something fails our script executes an scontrol command to drain the node and updates the reason with something like "$SCRATCH is not mounted". We don't have nodes start as being online so if the health check runs and everything looks ok the node gets put online, otherwise the node will continue to not be available for job submission.
As an side we have logic in our check so if the keyword "ADMIN" appears in the reason field the node health check takes no action. Kind regards, George ________________________________________ From: Christopher Samuel [[email protected]] Sent: 01 September 2014 01:45 To: slurm-dev Subject: [slurm-dev] Re: starting slurmd only after GPUs are fully initialized On 30/08/14 02:14, Lev Givon wrote: > Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't > started before any GPU device files appear? To be honest my policy has been for many years to never start queuing system daemons on boot, it's too easy to have a node go bad, reboot, come back up, take a job, go bad, reboot, take a job, go bad, reboot, repeat until no jobs left. DIMMs go bad, IB & accelerator cards go bad and cause NMIs, for us it's not worth the risk. We rarely reboot nodes other than hardware failure or for a software upgrade so if one does go bad we want to go and find out why before we let it back into the cluster. All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
