Re: [slurm-users] draining nodes due to failed killing of task?

Adrian Sevcenco Fri, 06 Aug 2021 04:34:46 -0700

On 8/6/21 1:56 PM, Diego Zuccato wrote:

We had a similar problem some time ago (slow creation of big core files) and 
solved it by increasing the Slurm timeouts

oh, i see.. well, in principle i should not have core files, and i do not find 
any...

to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabledcore files and restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian


Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!

Might it be due to a timeout (maybe the killed job is creating a core file, or 
caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed 
killing? and how can i control/disable
this?

Thank you!
Adrian


BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47      1    alien*    draining   48   48:1:1 193324 214030      1 
rack-0,4 Kill task failed
alien-0-56      1    alien*     drained   48   48:1:1 193324 214030      1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian

Re: [slurm-users] draining nodes due to failed killing of task?

Reply via email to