Hi Mike,

We're just starting to look at preempting. So I can't help you there (just 
yet!).


However I was testing with signals sent by slurm awhile ago. I'm pretty sure 
it's SIGTERM. I've attached a silly do nothing program that I was using.


Tim


________________________________
From: Mike Dacre <[email protected]>
Sent: Thursday, February 25, 2016 6:37 PM
To: slurm-dev
Subject: [slurm-dev] Kill Signals Sent By SLURM

Hi All,

I am trying to incorporate checkpointing using DMTCP into my SLURM jobs, 
specifically, to allow the checkpointing of a job when it is killed by SLURM on 
timeout or memory overuse (or anything else), to allow resubmission from the 
checkpoint later. I have been talking with the DMTCP devs about this here: 
https://github.com/dmtcp/dmtcp/issues/324 but I have run into some trouble.
[https://avatars2.githubusercontent.com/u/77714?v=3&s=400]<https://github.com/dmtcp/dmtcp/issues/324>

dmtcp with programs that write files to disk? · Issue #324 · 
dmtcp/dmtcp<https://github.com/dmtcp/dmtcp/issues/324>
github.com
Is there any good way to use dmtcp with software that outputs files to disk? In 
my recent tests, if I run a program that writes large files to disk (in this 
case samtools), and then kill/restart th...



Even using the --signal command to sbatch, I cannot capture the kill signal 
sent to the job by SLURM. The script I am using is here: 
https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of whether 
I specify --signal with or without the B:, if I allow the job to timeout or 
kill it with scancel, my trap command is unable to trap the signal.

Do any of you know a better way of trapping exit signals with a slurm script? 
Do you by any chance know what signal SLURM sends to jobs when they are killed 
by scancel or for time or memory use reasons?

Thanks so much,

Mike
/*
 * Copyright (C) 2014  Timothy Brown
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>.
 *
 */

/*
 * Slurm sends two SIGTERMs to indicate a process shutdown.
 * They are at least 30s apart. Giving your process 1min, before
 * it sends a SIGKILL.
 *
 * This is a typical example of how to catch SIGTERM and start a
 * safe shutdown.
 */

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <err.h>
#include <sysexits.h>

void shutdown(int, siginfo_t *, void *);

int
main(int argc, char **argv)
{

        struct sigaction action = {0};

        action.sa_sigaction = &shutdown;
        action.sa_flags = SA_SIGINFO;
        if (sigaction(SIGTERM, &action, NULL) == -1) {
                err(EX_SOFTWARE, "Unable to install SIGTERM action");
        }

        while (1) {
                sleep(5);
        }

        return(EXIT_SUCCESS);
}

void
shutdown(int sig, siginfo_t *siginfo, void *context)
{
        warnx("WARNING: Received SIGTERM (%ld)", (long)siginfo->si_signo);
        fflush(NULL);
        /* Do anything else that is needed before exiting */
        exit(EXIT_SUCCESS);
}

Reply via email to