Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-09 Thread Bjørn-Helge Mevik
Adrian Sevcenco  writes:

> Having just implemented some triggers i just noticed this:
>
> NODELISTNODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
> AVAIL_FE REASON
> alien-0-47  1alien*draining   48   48:1:1 193324   214030  1 
> rack-0,4 Kill task failed
> alien-0-56  1alien* drained   48   48:1:1 193324   214030  1 
> rack-0,4 Kill task failed
>
> i was wondering why a node is drained when killing of task fails

I guess the heuristic is that something is wrong with the node, so it
should not run more jobs.  Like Disk-waits or similar that might require
a reboot.

> and how can i disable it? (i use cgroups)

I don't know how to disable it, but it can be tuned with:

   UnkillableStepTimeout
  The  length  of time, in seconds, that Slurm will wait before
  deciding that processes in a job step are  unkillable  (after
  they  have  been  signaled  with SIGKILL) and execute Unkill‐
  ableStepProgram.  The default timeout value  is  60  seconds.
  If  exceeded,  the  compute  node  will be drained to prevent
  future jobs from being scheduled on the node.

(Note though, that according to
https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set
higher than 127 s.)

You might also want to look at this setting to find out what is going on
on the machine when Slurm cannot kill the job step:

   UnkillableStepProgram
  If  the  processes in a job step are determined to be unkill‐
  able for a period of time specified by the  UnkillableStepTi‐
  meout  variable,  the program specified by UnkillableStepPro‐
  gram will be executed.  By default no program is run.

  See section UNKILLABLE STEP PROGRAM SCRIPT for more  informa‐
  tion.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature


Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-08 Thread Christopher Samuel

On 8/7/21 11:47 pm, Adrian Sevcenco wrote:

yes, the jobs that are running have a part of file saving if they are 
killed,

saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..


Slurm does let you request a signal a certain amount of time before the 
job is due to end, you could make your job use that to do the checkpoint 
in advance of the end of the job so you don't hit this problem.


Look at the --signal option in "man sbatch".

Best of luck!
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-08 Thread Adrian Sevcenco

Hi!

On 8/8/21 3:19 AM, Chris Samuel wrote:

On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:


i was wondering why a node is drained when killing of task fails and how can
i disable it? (i use cgroups) moreover, how can the killing of task fails?
(this is on slurm 19.05)


Slurm has tried to kill processes, but they refuse to go away. Usually this
means they're stuck in a device or I/O wait for some reason, so look for
processes that are in a "D" state on the node.

yes, the jobs that are running have a part of file saving if they are killed,
saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..


As others have said they can be stuck writing out large files and waiting for
the kernel to complete that before they exit.  This can also happen if you're
using GPUs and something has gone wrong in the driver and the process is stuck
in the kernel somewhere.

You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the
kernel reports tasks stuck and where they are stuck.

If there are tasks stuck in that state then often the only recourse is to
reboot the node back into health.

yeah, this would be bad as is also the move to draining ..  i use batch jobs
and i can have 128 different jobs on a single node .. i will see if i can 
increase
some timeouts


You can tell Slurm to run a program on the node should it find itself in this
state, see:

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram

oh, thanks for the hint, i glossed over this when looking over documentation
in first approximation i can make this a tool for reporting what is going on
and later add some actions..

Thanks a lot!
Adrian



Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-07 Thread Chris Samuel
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote:

> i was wondering why a node is drained when killing of task fails and how can
> i disable it? (i use cgroups) moreover, how can the killing of task fails?
> (this is on slurm 19.05)

Slurm has tried to kill processes, but they refuse to go away. Usually this 
means they're stuck in a device or I/O wait for some reason, so look for 
processes that are in a "D" state on the node.

As others have said they can be stuck writing out large files and waiting for 
the kernel to complete that before they exit.  This can also happen if you're 
using GPUs and something has gone wrong in the driver and the process is stuck 
in the kernel somewhere.

You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the 
kernel reports tasks stuck and where they are stuck.

If there are tasks stuck in that state then often the only recourse is to 
reboot the node back into health.

You can tell Slurm to run a program on the node should it find itself in this 
state, see:

https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram

Best of luck,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-07 Thread Adrian Sevcenco

On 8/6/21 6:06 PM, Willy Markuske wrote:

Adrian and Diego,

Hi!

Are you using AMD Epyc processors when viewing this issue? I've been having the same issue but only on dual AMD Epyc 

i do have some epyc nodes, but the cpu proportion is 50%/50% with broadwell 
cores ..
and i do not see a correlation/preference of the problem for the epyc ones

systems. I haven't tried changing the core file location from an NFS mount though so perhaps there's an issue writing it 
out in time.


How did you disable core files?

to tell the trouth i do not know at this moment :)) i have to search in conf 
files,
but i see that :
[root@alien ~]# ulimit -a | grep core
core file size  (blocks, -c) 0

you can either add to /etc/security/limits.d/
a file with:
* hard core 0

and/or:
ulimit -S -c 0

HTH,
Adrian




Regards,

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (619) 519-4435

On 8/6/21 6:16 AM, Adrian Sevcenco wrote:

On 8/6/21 3:19 PM, Diego Zuccato wrote:

IIRC we increased SlurmdTimeout to 7200 .

Thanks a lot!

Adrian



Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm 
timeouts

oh, i see.. well, in principle i should not have core files, and i do not find 
any...

to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I 
disabled core files and restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian




Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!


Might it be due to a timeout (maybe the killed job is creating a core file, or 
caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed 
killing? and how can i control/disable
this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian







--
--
Adrian Sevcenco, Ph.D.   |
Institute of Space Science - ISS, Romania|
adrian.sevcenco at {cern.ch,spacescience.ro} |
--




Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Willy Markuske

Adrian and Diego,

Are you using AMD Epyc processors when viewing this issue? I've been 
having the same issue but only on dual AMD Epyc systems. I haven't tried 
changing the core file location from an NFS mount though so perhaps 
there's an issue writing it out in time.


How did you disable core files?

Regards,

Willy Markuske

HPC Systems Engineer



Research Data Services

P: (619) 519-4435

On 8/6/21 6:16 AM, Adrian Sevcenco wrote:

On 8/6/21 3:19 PM, Diego Zuccato wrote:

IIRC we increased SlurmdTimeout to 7200 .

Thanks a lot!

Adrian



Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big core 
files) and solved it by increasing the Slurm timeouts
oh, i see.. well, in principle i should not have core files, and i 
do not find any...


to the point that even the slowest core wouldn't trigger it. Then, 
once the need for core files was over, I disabled core files and 
restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian




Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!

Might it be due to a timeout (maybe the killed job is creating a 
core file, or caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason 
of failed killing? and how can i control/disable

this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY 
TMP_DISK WEIGHT AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 
214030  1 rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 
214030  1 rack-0,4 Kill task failed


i was wondering why a node is drained when killing of task fails 
and how can i disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 
19.05)


Thank you!
Adrian






Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco

On 8/6/21 3:19 PM, Diego Zuccato wrote:

IIRC we increased SlurmdTimeout to 7200 .

Thanks a lot!

Adrian



Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

On 8/6/21 1:56 PM, Diego Zuccato wrote:

We had a similar problem some time ago (slow creation of big core files) and 
solved it by increasing the Slurm timeouts

oh, i see.. well, in principle i should not have core files, and i do not find 
any...

to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled 
core files and restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian




Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!


Might it be due to a timeout (maybe the killed job is creating a core file, or 
caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed 
killing? and how can i control/disable
this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian






Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato

IIRC we increased SlurmdTimeout to 7200 .

Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big core 
files) and solved it by increasing the Slurm timeouts
oh, i see.. well, in principle i should not have core files, and i do 
not find any...


to the point that even the slowest core wouldn't trigger it. Then, 
once the need for core files was over, I disabled core files and 
restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian




Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!

Might it be due to a timeout (maybe the killed job is creating a 
core file, or caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of 
failed killing? and how can i control/disable

this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY 
TMP_DISK WEIGHT AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 
214030  1 rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 
214030  1 rack-0,4 Kill task failed


i was wondering why a node is drained when killing of task fails 
and how can i disable it? (i use cgroups)

moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian





--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco

On 8/6/21 1:56 PM, Diego Zuccato wrote:

We had a similar problem some time ago (slow creation of big core files) and 
solved it by increasing the Slurm timeouts

oh, i see.. well, in principle i should not have core files, and i do not find 
any...

to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled 
core files and restored the timeouts.

and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian




Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!


Might it be due to a timeout (maybe the killed job is creating a core file, or 
caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed 
killing? and how can i control/disable
this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian






Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato
We had a similar problem some time ago (slow creation of big core files) 
and solved it by increasing the Slurm timeouts to the point that even 
the slowest core wouldn't trigger it. Then, once the need for core files 
was over, I disabled core files and restored the timeouts.


Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!

Might it be due to a timeout (maybe the killed job is creating a core 
file, or caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of 
failed killing? and how can i control/disable

this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK 
WEIGHT AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 
214030  1 rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 
214030  1 rack-0,4 Kill task failed


i was wondering why a node is drained when killing of task fails and 
how can i disable it? (i use cgroups)

moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian









--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco

On 8/6/21 1:27 PM, Diego Zuccato wrote:

Hi.

Hi!


Might it be due to a timeout (maybe the killed job is creating a core file, or 
caused heavy swap usage)?

i will have to search for culprit ..
the problem is why would the node be put in drain for the reason of failed 
killing? and how can i control/disable
this?

Thank you!
Adrian




BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324 214030  1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian









Re: [slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Diego Zuccato

Hi.

Might it be due to a timeout (maybe the killed job is creating a core 
file, or caused heavy swap usage)?


BYtE,
 Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:

Having just implemented some triggers i just noticed this:

NODELIST    NODES PARTITION   STATE CPUS    S:C:T MEMORY TMP_DISK 
WEIGHT AVAIL_FE REASON
alien-0-47  1    alien*    draining   48   48:1:1 193324   
214030  1 rack-0,4 Kill task failed
alien-0-56  1    alien* drained   48   48:1:1 193324   
214030  1 rack-0,4 Kill task failed


i was wondering why a node is drained when killing of task fails and how 
can i disable it? (i use cgroups)

moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian




--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



[slurm-users] draining nodes due to failed killing of task?

2021-08-06 Thread Adrian Sevcenco

Having just implemented some triggers i just noticed this:

NODELISTNODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT 
AVAIL_FE REASON
alien-0-47  1alien*draining   48   48:1:1 193324   214030  1 
rack-0,4 Kill task failed
alien-0-56  1alien* drained   48   48:1:1 193324   214030  1 
rack-0,4 Kill task failed

i was wondering why a node is drained when killing of task fails and how can i 
disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian