Re: [slurm-users] [External] Re: Draining hosts because of failing jobs

Prentice Bisbal Tue, 04 May 2021 16:47:14 -0700

I haven't thought about it too hard, but the default NHC scripts donot notice that.

That's the problem with NHC and any other problem-checking script: Youhave to tell them what errors to check for. As you errors occur, thosescripts inevitably grow longer.


--
Prentice

On 5/4/21 12:47 PM, Alex Chekholko wrote:

In my most recent experience, I have some SSDs in compute nodes thatoccasionally just drop off the bus, so the compute node loses its OSdisk. I haven't thought about it too hard, but the default NHCscripts do not notice that. Similarly, Paul's proposed script mightneed to also check that the slurm log file is readable.The way I detect it myself is when a random swath of jobs fails andthen when I SSH to the node and get an I/O error instead of a regularconnection.
On Tue, May 4, 2021 at 9:41 AM Paul Edmon <ped...@cfa.harvard.edu<mailto:ped...@cfa.harvard.edu>> wrote:
    Since you can run an arbitrary script as a node health checker I
    might
    add a script that counts failures and then closes if it hits a
    threshold.  The script shouldn't need to talk to the slurmctld or
    slurmdbd as it should be able to watch the log on the node and see
    the fail.

    -Paul Edmon-

    On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
    > Hello,
    >
    > how do you implement something like "drain host after 10 consecutive
    > failed jobs"? Unlike a host check script, that checks for known
    errors,
    > I'd like to stop killing jobs just because one node is faulty.
    >
    > Gerhard
    >

Re: [slurm-users] [External] Re: Draining hosts because of failing jobs

Reply via email to