Any non-zero exit code from prolog or epilog triggers action below. Add "exit 0" to end of script if you want to be certain of preventing this.
Quoting Barbara Krasovec <[email protected]>: > > Hello! > > I have a problem on some worker nodes. When epilog script fails with 255 > error for one job, the node goes to state Drain and all the jobs on the > node are cancelled. Epilog script only contains moving files from > working directory to another location. Apparently slurmctld assumes that > the node is awol and cleans it. > > Nothing really useful in the logs.. > > Example: > > Slurmd logs: > [2012-12-03T17:27:46] error: /etc/slurm/epilog: exited with status 0x0100 > [2012-12-03T17:27:46] error: [job 2470167] epilog failed status=1:0 > [2012-12-03T17:28:13] [2466747] *** JOB 2466747 CANCELLED AT > 2012-12-03T17:28:13 DUE TO NODE FAILURE *** > [2012-12-03T17:28:19] [2466747] sending REQUEST_COMPLETE_BATCH_SCRIPT, > error:0 > [2012-12-03T17:29:05] [2466747] done with job > [2012-12-03T17:29:05] epilog for job 2466747 ran for 33 seconds > > Logs from slurmctld: > [2012-12-03T17:27:46] Killing job_id 2466747 on failed node hostname > [2012-12-03T17:29:05] completing job 2466747 > [2012-12-03T17:29:05] _slurm_rpc_complete_batch_script JobId=2466747: > Job/step already completing or completed > > Any help would be appreciated. > Thanks, > Barbara >
