Any non-zero exit code from prolog or epilog triggers action below.  
Add "exit 0" to end of script if you want to be certain of preventing  
this.

Quoting Barbara Krasovec <[email protected]>:

>
> Hello!
>
> I have a problem on some worker nodes. When epilog script fails with 255
> error for one job, the node goes to state Drain and all the jobs on the
> node are cancelled. Epilog script only contains moving files from
> working directory to another location. Apparently slurmctld assumes that
> the node is awol and cleans it.
>
> Nothing really useful in the logs..
>
> Example:
>
> Slurmd logs:
> [2012-12-03T17:27:46] error: /etc/slurm/epilog: exited with status 0x0100
> [2012-12-03T17:27:46] error: [job 2470167] epilog failed status=1:0
> [2012-12-03T17:28:13] [2466747] *** JOB 2466747 CANCELLED AT
> 2012-12-03T17:28:13 DUE TO NODE FAILURE ***
> [2012-12-03T17:28:19] [2466747] sending REQUEST_COMPLETE_BATCH_SCRIPT,
> error:0
> [2012-12-03T17:29:05] [2466747] done with job
> [2012-12-03T17:29:05] epilog for job 2466747 ran for 33 seconds
>
> Logs from slurmctld:
> [2012-12-03T17:27:46] Killing job_id 2466747 on failed node hostname
> [2012-12-03T17:29:05] completing job 2466747
> [2012-12-03T17:29:05] _slurm_rpc_complete_batch_script JobId=2466747:
> Job/step already completing or completed
>
> Any help would be appreciated.
> Thanks,
> Barbara
>

Reply via email to