Hi Chris

Thanks for the clarification 

Mike

-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Chris 
Samuel
Sent: Tuesday, 23 March 2021 5:30 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Slurm - UnkillableStepProgram

Hi Mike,

On 22/3/21 7:12 pm, Yap, Mike wrote:

> # I presume UnkillableStepTimeout is set in slurm.conf. and it act as 
> a timer to trigger UnkillableStepProgram

That is correct.

> # UnkillableStepProgram   can be use to send email or reboot compute 
> node - question is how do we configure it ?

Also - or to automate collecting debug info (which is what we do) and then we 
manually intervene to reboot the node once we've determined there's no more 
useful info to collect.

It's just configured in your slurm.conf.

UnkillableStepProgram=/path/to/the/unkillable/step/script.sh

Of course this script has to be present on every compute node.

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


Reply via email to