I also tested this feature with a perl script and a binary (also a simple 
counter).
But I get a similar error message when restarting the job:
'No DMTCP environment or bad ID values: ID=0, IDS=. Exit.'
And the job in slurm aborts.
Restarting the job outside of slurm with these examples works fine.

I also setup and tried dmtcp within older slurm v17.11.2 but I get the same 
results.
What is the right environment for using dmtcp with slurm?

Best regards
Werner


On 7/11/19 11:28 AM, Werner Hack wrote:
Hi all,

I want to use dmtcp with slurm jobs.
I have setup a small cluster for testing with dmtcp v2.5.2 and slurm v19.05.0.
For starting and restarting jobs I use the sample scripts 'slurm_launch.job'
and 'slurm_rstr.job' in ./dmtcp-2.5.2/plugin/batch-queue/job_examples.

Command in script to launch:
dmtcp_launch ./counter

Command in script to restart:
/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

For testing I use this simple shell script (./counter):

#!/bin/bash
i=0
while true; do
  echo $i
  sleep 1
  let i++
done

When the job is running in slurm I create a checkpoint manually with the script
dmtcp_command.jobid.
Restarting the job without slurm is working fine.
Restarting the job as Batchjob in slurm I get the following error:

Restart error: Cannot map initial resources into the restart allocation

Any ideas?
Is there anything I am doing wrong?

Best regards
Werner
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to