Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

Brian Andrus Mon, 24 May 2021 07:07:06 -0700

Not sure I can understand how it can only be detected from inside thejob environment for a failed node.

That description is more of "our application is behaving badly, but notso bad, the node quits responding." For that situation, your app or jobshould have something that it is doing to catch that and report it toslurm in some fashion (up to and including, kill the process).

Slurm polls the nodes and if slurmd does not respond, it will mark thenode as failed. So slurmd must be responding.

If you can provide a better description of what symptoms you see thatcause you to feel the node has failed, we can help a little more.


On 5/24/2021 3:02 AM, Mark Dixon wrote:

Hi all,
Sometimes our compute nodes get into a failed state which we can onlydetect from inside the job environment.
I can see that TaskProlog / TaskEpilog allows us to run our detectiontest; however, unlike Epilog and Prolog, they do not drain a node ifthey exit with a non-zero exit code.
Does anyone have advice on automatically draining a node in thissituation, please?
Best wishes,

Mark

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

Reply via email to