Re: [slurm-users] draining nodes due to failed killing of task?
Adrian Sevcenco writes: > Having just implemented some triggers i just noticed this: > > NODELISTNODES PARTITION STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT > AVAIL_FE REASON > alien-0-47 1alien*draining 48 48:1:1 193324 214030 1 > rack-0,4 Kill task failed > alien-0-56 1alien* drained 48 48:1:1 193324 214030 1 > rack-0,4 Kill task failed > > i was wondering why a node is drained when killing of task fails I guess the heuristic is that something is wrong with the node, so it should not run more jobs. Like Disk-waits or similar that might require a reboot. > and how can i disable it? (i use cgroups) I don't know how to disable it, but it can be tuned with: UnkillableStepTimeout The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute Unkill‐ ableStepProgram. The default timeout value is 60 seconds. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node. (Note though, that according to https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set higher than 127 s.) You might also want to look at this setting to find out what is going on on the machine when Slurm cannot kill the job step: UnkillableStepProgram If the processes in a job step are determined to be unkill‐ able for a period of time specified by the UnkillableStepTi‐ meout variable, the program specified by UnkillableStepPro‐ gram will be executed. By default no program is run. See section UNKILLABLE STEP PROGRAM SCRIPT for more informa‐ tion. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/7/21 11:47 pm, Adrian Sevcenco wrote: yes, the jobs that are running have a part of file saving if they are killed, saving which depending of the target can get stuck ... i have to think for a way to take a processes snapshot when this happens .. Slurm does let you request a signal a certain amount of time before the job is due to end, you could make your job use that to do the checkpoint in advance of the end of the job so you don't hit this problem. Look at the --signal option in "man sbatch". Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] draining nodes due to failed killing of task?
Hi! On 8/8/21 3:19 AM, Chris Samuel wrote: On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote: i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Slurm has tried to kill processes, but they refuse to go away. Usually this means they're stuck in a device or I/O wait for some reason, so look for processes that are in a "D" state on the node. yes, the jobs that are running have a part of file saving if they are killed, saving which depending of the target can get stuck ... i have to think for a way to take a processes snapshot when this happens .. As others have said they can be stuck writing out large files and waiting for the kernel to complete that before they exit. This can also happen if you're using GPUs and something has gone wrong in the driver and the process is stuck in the kernel somewhere. You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the kernel reports tasks stuck and where they are stuck. If there are tasks stuck in that state then often the only recourse is to reboot the node back into health. yeah, this would be bad as is also the move to draining .. i use batch jobs and i can have 128 different jobs on a single node .. i will see if i can increase some timeouts You can tell Slurm to run a program on the node should it find itself in this state, see: https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram oh, thanks for the hint, i glossed over this when looking over documentation in first approximation i can make this a tool for reporting what is going on and later add some actions.. Thanks a lot! Adrian
Re: [slurm-users] draining nodes due to failed killing of task?
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote: > i was wondering why a node is drained when killing of task fails and how can > i disable it? (i use cgroups) moreover, how can the killing of task fails? > (this is on slurm 19.05) Slurm has tried to kill processes, but they refuse to go away. Usually this means they're stuck in a device or I/O wait for some reason, so look for processes that are in a "D" state on the node. As others have said they can be stuck writing out large files and waiting for the kernel to complete that before they exit. This can also happen if you're using GPUs and something has gone wrong in the driver and the process is stuck in the kernel somewhere. You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the kernel reports tasks stuck and where they are stuck. If there are tasks stuck in that state then often the only recourse is to reboot the node back into health. You can tell Slurm to run a program on the node should it find itself in this state, see: https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/6/21 6:06 PM, Willy Markuske wrote: Adrian and Diego, Hi! Are you using AMD Epyc processors when viewing this issue? I've been having the same issue but only on dual AMD Epyc i do have some epyc nodes, but the cpu proportion is 50%/50% with broadwell cores .. and i do not see a correlation/preference of the problem for the epyc ones systems. I haven't tried changing the core file location from an NFS mount though so perhaps there's an issue writing it out in time. How did you disable core files? to tell the trouth i do not know at this moment :)) i have to search in conf files, but i see that : [root@alien ~]# ulimit -a | grep core core file size (blocks, -c) 0 you can either add to /etc/security/limits.d/ a file with: * hard core 0 and/or: ulimit -S -c 0 HTH, Adrian Regards, Willy Markuske HPC Systems Engineer Research Data Services P: (619) 519-4435 On 8/6/21 6:16 AM, Adrian Sevcenco wrote: On 8/6/21 3:19 PM, Diego Zuccato wrote: IIRC we increased SlurmdTimeout to 7200 . Thanks a lot! Adrian Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not have core files, and i do not find any... to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- -- Adrian Sevcenco, Ph.D. | Institute of Space Science - ISS, Romania| adrian.sevcenco at {cern.ch,spacescience.ro} | --
Re: [slurm-users] draining nodes due to failed killing of task?
Adrian and Diego, Are you using AMD Epyc processors when viewing this issue? I've been having the same issue but only on dual AMD Epyc systems. I haven't tried changing the core file location from an NFS mount though so perhaps there's an issue writing it out in time. How did you disable core files? Regards, Willy Markuske HPC Systems Engineer Research Data Services P: (619) 519-4435 On 8/6/21 6:16 AM, Adrian Sevcenco wrote: On 8/6/21 3:19 PM, Diego Zuccato wrote: IIRC we increased SlurmdTimeout to 7200 . Thanks a lot! Adrian Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not have core files, and i do not find any... to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/6/21 3:19 PM, Diego Zuccato wrote: IIRC we increased SlurmdTimeout to 7200 . Thanks a lot! Adrian Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not have core files, and i do not find any... to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian
Re: [slurm-users] draining nodes due to failed killing of task?
IIRC we increased SlurmdTimeout to 7200 . Il 06/08/2021 13:33, Adrian Sevcenco ha scritto: On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not have core files, and i do not find any... to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/6/21 1:56 PM, Diego Zuccato wrote: We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts oh, i see.. well, in principle i should not have core files, and i do not find any... to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. and how much did you increased them? i have SlurmctldTimeout=300 SlurmdTimeout=300 Thank you! Adrian Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian
Re: [slurm-users] draining nodes due to failed killing of task?
We had a similar problem some time ago (slow creation of big core files) and solved it by increasing the Slurm timeouts to the point that even the slowest core wouldn't trigger it. Then, once the need for core files was over, I disabled core files and restored the timeouts. Il 06/08/2021 12:46, Adrian Sevcenco ha scritto: On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/6/21 1:27 PM, Diego Zuccato wrote: Hi. Hi! Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? i will have to search for culprit .. the problem is why would the node be put in drain for the reason of failed killing? and how can i control/disable this? Thank you! Adrian BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian
Re: [slurm-users] draining nodes due to failed killing of task?
Hi. Might it be due to a timeout (maybe the killed job is creating a core file, or caused heavy swap usage)? BYtE, Diego Il 06/08/2021 09:02, Adrian Sevcenco ha scritto: Having just implemented some triggers i just noticed this: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1 alien* draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1 alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
[slurm-users] draining nodes due to failed killing of task?
Having just implemented some triggers i just noticed this: NODELISTNODES PARTITION STATE CPUSS:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON alien-0-47 1alien*draining 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed alien-0-56 1alien* drained 48 48:1:1 193324 214030 1 rack-0,4 Kill task failed i was wondering why a node is drained when killing of task fails and how can i disable it? (i use cgroups) moreover, how can the killing of task fails? (this is on slurm 19.05) Thank you! Adrian