On 8/7/21 11:47 pm, Adrian Sevcenco wrote:

yes, the jobs that are running have a part of file saving if they are killed,
saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..

Slurm does let you request a signal a certain amount of time before the job is due to end, you could make your job use that to do the checkpoint in advance of the end of the job so you don't hit this problem.

Look at the --signal option in "man sbatch".

Best of luck!
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to