I have used --signal successfully, though haven't thoroughly tested since
our upgrade from 14.03.10 to 15.08.6.  The batch script uses '#SBATCH
--signal=USR1@300' to ensure a job would cleanly exit before its time ran
out.  In my case the program catching the signal was Python, and the only
way I could make it work was executing the python script with srun.  The
docs for --signal indicate it only signals when a job approaches its time
limit, not sure if there is a way to capture the kill signal before a job
is killed from memory use.

The scancel signal by default I believe is SIGKILL, but that can be
modified with the --signal/-s.

Python example below.  I sent 'scancel -s 10' to my job and it printed
"Signal caught".  The same message is printed if I let the job run up to
its time limit.

$ cat test_signal.py
#!/usr/bin/env python

import signal
import sys
import time

def signal_exit(signum, frame):
  print("Signal caught")
  sys.exit(0)


signal.signal(signal.SIGUSR1, signal_exit)

time.sleep(300)

$ cat test_signal.slrm
#!/bin/bash
#SBATCH --signal=USR1@10
#SBATCH --time=00:02:00

srun python $HOME/test/test_signal.py

- Trey

On Thu, Feb 25, 2016 at 7:37 PM, Mike Dacre <[email protected]> wrote:

> Hi All,
>
> I am trying to incorporate checkpointing using DMTCP into my SLURM jobs,
> specifically, to allow the checkpointing of a job when it is killed by
> SLURM on timeout or memory overuse (or anything else), to allow
> resubmission from the checkpoint later. I have been talking with the DMTCP
> devs about this here: https://github.com/dmtcp/dmtcp/issues/324 but I
> have run into some trouble.
>
> Even using the --signal command to sbatch, I cannot capture the kill
> signal sent to the job by SLURM. The script I am using is here:
> https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of
> whether I specify --signal with or without the B:, if I allow the job to
> timeout or kill it with scancel, my trap command is unable to trap the
> signal.
>
> Do any of you know a better way of trapping exit signals with a slurm
> script? Do you by any chance know what signal SLURM sends to jobs when they
> are killed by scancel or for time or memory use reasons?
>
> Thanks so much,
>
> Mike
>

Reply via email to