Hi All, I am trying to incorporate checkpointing using DMTCP into my SLURM jobs, specifically, to allow the checkpointing of a job when it is killed by SLURM on timeout or memory overuse (or anything else), to allow resubmission from the checkpoint later. I have been talking with the DMTCP devs about this here: https://github.com/dmtcp/dmtcp/issues/324 but I have run into some trouble.
Even using the --signal command to sbatch, I cannot capture the kill signal sent to the job by SLURM. The script I am using is here: https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of whether I specify --signal with or without the B:, if I allow the job to timeout or kill it with scancel, my trap command is unable to trap the signal. Do any of you know a better way of trapping exit signals with a slurm script? Do you by any chance know what signal SLURM sends to jobs when they are killed by scancel or for time or memory use reasons? Thanks so much, Mike
