In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS disk.
I haven't thought about it too hard, but the default NHC scripts do not
notice that. Similarly, Paul's proposed script might need to also check
that the
Since you can run an arbitrary script as a node health checker I might
add a script that counts failures and then closes if it hits a
threshold. The script shouldn't need to talk to the slurmctld or
slurmdbd as it should be able to watch the log on the node and see the fail.
-Paul Edmon-
On
Hello,
how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.
Gerhard