I also tested this feature with a perl script and a binary (also a simple counter). But I get a similar error message when restarting the job: 'No DMTCP environment or bad ID values: ID=0, IDS=. Exit.' And the job in slurm aborts. Restarting the job outside of slurm with these examples works fine.
I also setup and tried dmtcp within older slurm v17.11.2 but I get the same results. What is the right environment for using dmtcp with slurm? Best regards Werner On 7/11/19 11:28 AM, Werner Hack wrote:
Hi all, I want to use dmtcp with slurm jobs. I have setup a small cluster for testing with dmtcp v2.5.2 and slurm v19.05.0. For starting and restarting jobs I use the sample scripts 'slurm_launch.job' and 'slurm_rstr.job' in ./dmtcp-2.5.2/plugin/batch-queue/job_examples. Command in script to launch: dmtcp_launch ./counter Command in script to restart: /bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT For testing I use this simple shell script (./counter): #!/bin/bash i=0 while true; do echo $i sleep 1 let i++ done When the job is running in slurm I create a checkpoint manually with the script dmtcp_command.jobid. Restarting the job without slurm is working fine. Restarting the job as Batchjob in slurm I get the following error: Restart error: Cannot map initial resources into the restart allocation Any ideas? Is there anything I am doing wrong? Best regards Werner _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
