As far as I know, Slurm sends SIGTERM for timeout, and we use cgroups for
memory enforcement which will cause an OOM abort (ERR) of the process outside
Slurm when the process insists on exceeding the memory limit. When a task’s
timing out, you should only get a SIGKILL on timeout + KillWait seconds.
scancel documentation claims:
To cancel a job, invoke scancel without --signal option. This will
send first a SIGCONT to all steps to eventually wake them up followed
by a SIGTERM, then wait the KillWait duration defined in the slurm.conf
file and finally if they have not terminated send a SIGKILL. This
gives time for the running job/step(s) to clean up.
Another interesting pointer to the way things are working is in the --full
option to scancel:
-f, --full
Signal all steps associated with the job including any batch
step (the shell script plus all of its child processes). By
default, signals other than SIGKILL are not sent to the batch
step. Also see the -b, --batch option
To get your traps to work - basically, try to avoid trapping SIGTERM, even
though it is the exact signal you’re looking for.
If you trap SIGTERM in the batch script, and you’re not launching children via
srun, you won’t end up killing the child processes - they’ll continue to run,
then get SIGKILL-ed later.
We encountered this problem within cylc (https://github.com/cylc/cylc) which
likes to have tasks communicate back before they finish - we always want to
successfully trap failures if possible.
https://github.com/cylc/cylc/pull/1287 has a bit of explanation about the
problem and how we fixed it, but it boils down to - don’t trap SIGTERM!
We ended up having traps like this in our cylc batch scripts (skip the
cylc/CYLC lines, obviously, XCPU is unnecessary (but the best signal for
timeouts!) and VACATION_SIGNALS is unset for Slurm):
set -u # Fail when using an undefined variable
FAIL_SIGNALS='EXIT ERR XCPU'
TRAP_FAIL_SIGNAL() {
typeset SIGNAL=$1
echo "Received signal $SIGNAL" >&2
typeset S=
for S in ${VACATION_SIGNALS:-} $FAIL_SIGNALS; do
trap "" $S
done
if [[ -n "${CYLC_TASK_MESSAGE_STARTED_PID:-}" ]]; then
wait "${CYLC_TASK_MESSAGE_STARTED_PID}" 2>/dev/null || true
fi
cylc task message -p 'CRITICAL' "Task job script received signal $SIGNAL"
'failed'
exit 1
}
for S in $FAIL_SIGNALS; do
trap "TRAP_FAIL_SIGNAL $S" $S
done
unset S
We then run the process and then have:
trap '' EXIT
so that we can actually quit the script successfully at the end!
Uncaught signals filter down to the running child processes, which then causes
them to abort and trigger the above trap - so it works whatever you do.
I think the sleep command has a peculiar interaction with signals, so it may
not be the best command to try.
Cheers,
Ben
From: Mike Dacre [mailto:[email protected]]
Sent: 26 February 2016 01:36
To: slurm-dev
Subject: [slurm-dev] Kill Signals Sent By SLURM
Hi All,
I am trying to incorporate checkpointing using DMTCP into my SLURM jobs,
specifically, to allow the checkpointing of a job when it is killed by SLURM on
timeout or memory overuse (or anything else), to allow resubmission from the
checkpoint later. I have been talking with the DMTCP devs about this here:
https://github.com/dmtcp/dmtcp/issues/324 but I have run into some trouble.
Even using the --signal command to sbatch, I cannot capture the kill signal
sent to the job by SLURM. The script I am using is here:
https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of whether
I specify --signal with or without the B:, if I allow the job to timeout or
kill it with scancel, my trap command is unable to trap the signal.
Do any of you know a better way of trapping exit signals with a slurm script?
Do you by any chance know what signal SLURM sends to jobs when they are killed
by scancel or for time or memory use reasons?
Thanks so much,
Mike