I haven't thought about it too hard, but the default NHC scripts do not notice that.

That's the problem with NHC and any other problem-checking script: You have to tell them what errors to check for. As you errors occur, those scripts inevitably grow longer.

--
Prentice

On 5/4/21 12:47 PM, Alex Chekholko wrote:
In my most recent experience, I have some SSDs in compute nodes that occasionally just drop off the bus, so the compute node loses its OS disk.  I haven't thought about it too hard, but the default NHC scripts do not notice that. Similarly, Paul's proposed script might need to also check that the slurm log file is readable. The way I detect it myself is when a random swath of jobs fails and then when I SSH to the node and get an I/O error instead of a regular connection.

On Tue, May 4, 2021 at 9:41 AM Paul Edmon <ped...@cfa.harvard.edu <mailto:ped...@cfa.harvard.edu>> wrote:

    Since you can run an arbitrary script as a node health checker I
    might
    add a script that counts failures and then closes if it hits a
    threshold.  The script shouldn't need to talk to the slurmctld or
    slurmdbd as it should be able to watch the log on the node and see
    the fail.

    -Paul Edmon-

    On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
    > Hello,
    >
    > how do you implement something like "drain host after 10 consecutive
    > failed jobs"? Unlike a host check script, that checks for known
    errors,
    > I'd like to stop killing jobs just because one node is faulty.
    >
    > Gerhard
    >

Reply via email to