Hi!

On 8/8/21 3:19 AM, Chris Samuel wrote:
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:

i was wondering why a node is drained when killing of task fails and how can
i disable it? (i use cgroups) moreover, how can the killing of task fails?
(this is on slurm 19.05)

Slurm has tried to kill processes, but they refuse to go away. Usually this
means they're stuck in a device or I/O wait for some reason, so look for
processes that are in a "D" state on the node.
yes, the jobs that are running have a part of file saving if they are killed,
saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..

As others have said they can be stuck writing out large files and waiting for
the kernel to complete that before they exit.  This can also happen if you're
using GPUs and something has gone wrong in the driver and the process is stuck
in the kernel somewhere.

You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the
kernel reports tasks stuck and where they are stuck.

If there are tasks stuck in that state then often the only recourse is to
reboot the node back into health.
yeah, this would be bad as is also the move to draining ..  i use batch jobs
and i can have 128 different jobs on a single node .. i will see if i can 
increase
some timeouts

You can tell Slurm to run a program on the node should it find itself in this
state, see:

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram
oh, thanks for the hint, i glossed over this when looking over documentation
in first approximation i can make this a tool for reporting what is going on
and later add some actions..

Thanks a lot!
Adrian

Reply via email to