As far as I know, Slurm sends SIGTERM for timeout, and we use cgroups for 
memory enforcement which will cause an OOM abort (ERR) of the process outside 
Slurm when the process insists on exceeding the memory limit. When a task’s 
timing out, you should only get a SIGKILL on timeout + KillWait seconds.

scancel documentation claims:

To  cancel  a  job,  invoke scancel without --signal option.  This will
       send first a SIGCONT to all steps to eventually wake them  up  followed
       by a SIGTERM, then wait the KillWait duration defined in the slurm.conf
       file and finally if they have not  terminated  send  a  SIGKILL.   This
       gives time for the running job/step(s) to clean up.

Another interesting pointer to the way things are working is in the --full 
option to scancel:

       -f, --full
              Signal all steps associated with the  job  including  any  batch
              step  (the  shell  script  plus all of its child processes).  By
              default, signals other than SIGKILL are not sent  to  the  batch
              step.  Also see the -b, --batch option


To get your traps to work - basically, try to avoid trapping SIGTERM, even 
though it is the exact signal you’re looking for.

If you trap SIGTERM in the batch script, and you’re not launching children via 
srun, you won’t end up killing the child processes - they’ll continue to run, 
then get SIGKILL-ed later.

We encountered this problem within cylc (https://github.com/cylc/cylc) which 
likes to have tasks communicate back before they finish - we always want to 
successfully trap failures if possible.
https://github.com/cylc/cylc/pull/1287 has a bit of explanation about the 
problem and how we fixed it, but it boils down to - don’t trap SIGTERM!

We ended up having traps like this in our cylc batch scripts (skip the 
cylc/CYLC lines, obviously, XCPU is unnecessary (but the best signal for 
timeouts!) and VACATION_SIGNALS is unset for Slurm):

set -u # Fail when using an undefined variable
FAIL_SIGNALS='EXIT ERR XCPU'
TRAP_FAIL_SIGNAL() {
    typeset SIGNAL=$1
    echo "Received signal $SIGNAL" >&2
    typeset S=
    for S in ${VACATION_SIGNALS:-} $FAIL_SIGNALS; do
        trap "" $S
    done
    if [[ -n "${CYLC_TASK_MESSAGE_STARTED_PID:-}" ]]; then
        wait "${CYLC_TASK_MESSAGE_STARTED_PID}" 2>/dev/null || true
    fi
    cylc task message -p 'CRITICAL' "Task job script received signal $SIGNAL" 
'failed'
    exit 1
}
for S in $FAIL_SIGNALS; do
    trap "TRAP_FAIL_SIGNAL $S" $S
done
unset S

We then run the process and then have:

trap '' EXIT

so that we can actually quit the script successfully at the end!

Uncaught signals filter down to the running child processes, which then causes 
them to abort and trigger the above trap - so it works whatever you do.

I think the sleep command has a peculiar interaction with signals, so it may 
not be the best command to try.

Cheers,

Ben


From: Mike Dacre [mailto:[email protected]]
Sent: 26 February 2016 01:36
To: slurm-dev
Subject: [slurm-dev] Kill Signals Sent By SLURM

Hi All,

I am trying to incorporate checkpointing using DMTCP into my SLURM jobs, 
specifically, to allow the checkpointing of a job when it is killed by SLURM on 
timeout or memory overuse (or anything else), to allow resubmission from the 
checkpoint later. I have been talking with the DMTCP devs about this here: 
https://github.com/dmtcp/dmtcp/issues/324 but I have run into some trouble.

Even using the --signal command to sbatch, I cannot capture the kill signal 
sent to the job by SLURM. The script I am using is here: 
https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of whether 
I specify --signal with or without the B:, if I allow the job to timeout or 
kill it with scancel, my trap command is unable to trap the signal.

Do any of you know a better way of trapping exit signals with a slurm script? 
Do you by any chance know what signal SLURM sends to jobs when they are killed 
by scancel or for time or memory use reasons?

Thanks so much,

Mike

Reply via email to