We make use of the node health check (HealthCheckProgram in slurm.conf) to 
automatically put nodes online/ offline if things like mount points are not 
available. If something fails our script executes an scontrol command to drain 
the node and updates the reason with something like "$SCRATCH is not mounted". 
We don't have nodes start as being online so if the health check runs and 
everything looks ok the node gets put online, otherwise the node will continue 
to not be available for job submission.

As an side we have logic in our check so if the keyword "ADMIN" appears in the 
reason field the node health check takes no action.

Kind regards,
George
________________________________________
From: Christopher Samuel [[email protected]]
Sent: 01 September 2014 01:45
To: slurm-dev
Subject: [slurm-dev] Re: starting slurmd only after GPUs are fully initialized

On 30/08/14 02:14, Lev Givon wrote:

> Is there a recommended way (on Ubuntu, at least) to ensure that slurmd isn't
> started before any GPU device files appear?

To be honest my policy has been for many years to never start queuing
system daemons on boot, it's too easy to have a node go bad, reboot,
come back up, take a job, go bad, reboot, take a job, go bad, reboot,
repeat until no jobs left.

DIMMs go bad, IB & accelerator cards go bad and cause NMIs, for us it's
not worth the risk.

We rarely reboot nodes other than hardware failure or for a software
upgrade so if one does go bad we want to go and find out why before we
let it back into the cluster.

All the best,
Chris
--
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to