Hi All,

I am trying to incorporate checkpointing using DMTCP into my SLURM jobs,
specifically, to allow the checkpointing of a job when it is killed by
SLURM on timeout or memory overuse (or anything else), to allow
resubmission from the checkpoint later. I have been talking with the DMTCP
devs about this here: https://github.com/dmtcp/dmtcp/issues/324 but I have
run into some trouble.

Even using the --signal command to sbatch, I cannot capture the kill signal
sent to the job by SLURM. The script I am using is here:
https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of
whether I specify --signal with or without the B:, if I allow the job to
timeout or kill it with scancel, my trap command is unable to trap the
signal.

Do any of you know a better way of trapping exit signals with a slurm
script? Do you by any chance know what signal SLURM sends to jobs when they
are killed by scancel or for time or memory use reasons?

Thanks so much,

Mike

Reply via email to