Hello, Happy Victoria Day from Canada!
I am part of a team working on a few clusters with Computer Canada and I am trying to write a template for SLURM jobs with DMTCP checkpointing. Everything seems to be going well except for these two issues that I can't seem to find in the FAQ section or Compute Canada Documentation. 1. ./dmtcp_restart_script.sh not generated: As the FAQ section mentioned there should be a dmtcp_restart_script.sh generated by DMTCP which is better for safety and housekeeping. Unfortunately I can't seem to put in the correct options for it to generate that file. I have tried putting --new-coordinator as an option but it still doesn't work. Additionally, DMTCP only generates a ckpt_...._.dmtcp.temp file instead of a .dmtcp file. 2. Segmentation Fault core dumped: I am not sure if this is related to the first error - but as my python script times out (on purpose) the program raises a segmentation fault. I have attached my shell script with this email and some output from the SLRUM jobs I have submitted. Please let me know if there are anything that I am missing. Best, David
#!/bin/bash
#SBATCH --account=def-aghuang
#SBATCH --cpus-per-task=1
#SBATCH --time=00:04:00
#SBATCH --mem-per-cpu=2056M
#SBATCH --job-name=lets_try_resub
#SBATCH --output=%x-%j.out
### Script Control------------------------------------------------------------------------------------------------------
# Specifies the maximal amount this job can be resubmitted. Avoid using a huge number.
job_resubmission_limit=10
# Specifies how often DMTCP writes a checkpoint file. The number is in seconds.
check_point_interval=60
# Specifies the path from the folder shell script is in to the python script.
python_file_location="./python_script_template.py"
### End of Script Control-----------------------------------------------------------------------------------------------
### Create a local virtual environment and load all necessary packages
module load python/3.7
module load scipy-stack
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
# pip install --upgrade pip
# pip install --no-index -r requirements.txt
echo "Current working directory: `pwd`"
echo "Starting run at: `date`"
### If this job didn't finish, then restart the job using DMTCP script, otherwise start the job with DMTCP.
if test -e "num_times_script_has_resumed.tmp"; then
echo "Resuming a previous run"
./dmtcp_restart_script.sh -h $(hostname)
else
echo "Starting a new run"
echo 0 > num_times_script_has_resumed.tmp
dmtcp_launch --rm -i ${check_point_interval} python ${python_file_location}
fi
### Resubmit if not all work has been completed yet and we haven't hit the job resubmission limit
script_resumed_times=$(($(< num_times_script_has_resumed.tmp)+1))
if test -e "unique_flag_script_running.tmp"; then
if (($script_resumed_times <= $job_resubmission_limit)); then
echo "Resubmitting Job Attempt #$script_resumed_times"
echo ${script_resumed_times} > num_times_script_has_resumed.tmp
sbatch ${BASH_SOURCE[0]}
else
echo "FAILED: Job Resubmission Limit Reached! Work Incomplete"
fi
else
echo "Work Completed after $script_resumed_times Resubmissions"
rm num_times_script_has_resumed.tmp
fi
echo "Job finished with exit code $? at: `date`"
lets_try_resub-14980219.out
Description: Binary data
lets_try_resub-14980349.out
Description: Binary data
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
