Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Alex Chekholko
In my most recent experience, I have some SSDs in compute nodes that
occasionally just drop off the bus, so the compute node loses its OS disk.
I haven't thought about it too hard, but the default NHC scripts do not
notice that.  Similarly, Paul's proposed script might need to also check
that the slurm log file is readable.
The way I detect it myself is when a random swath of jobs fails and then
when I SSH to the node and get an I/O error instead of a regular
connection.

On Tue, May 4, 2021 at 9:41 AM Paul Edmon  wrote:

> Since you can run an arbitrary script as a node health checker I might
> add a script that counts failures and then closes if it hits a
> threshold.  The script shouldn't need to talk to the slurmctld or
> slurmdbd as it should be able to watch the log on the node and see the
> fail.
>
> -Paul Edmon-
>
> On 5/4/2021 12:09 PM, Gerhard Strangar wrote:
> > Hello,
> >
> > how do you implement something like "drain host after 10 consecutive
> > failed jobs"? Unlike a host check script, that checks for known errors,
> > I'd like to stop killing jobs just because one node is faulty.
> >
> > Gerhard
> >
>
>


Re: [slurm-users] Draining hosts because of failing jobs

2021-05-04 Thread Paul Edmon
Since you can run an arbitrary script as a node health checker I might 
add a script that counts failures and then closes if it hits a 
threshold.  The script shouldn't need to talk to the slurmctld or 
slurmdbd as it should be able to watch the log on the node and see the fail.


-Paul Edmon-

On 5/4/2021 12:09 PM, Gerhard Strangar wrote:

Hello,

how do you implement something like "drain host after 10 consecutive
failed jobs"? Unlike a host check script, that checks for known errors,
I'd like to stop killing jobs just because one node is faulty.

Gerhard