Hi,
I have been testing dmctp for c/r for a cluster. When I test manually, it runs
fine. Jobs restart as expected and is generally impressive.
When I try to integrate it for use within slurm I run into issues upon
restarting a process. It seems to be a cgroup issue. We have slurm create
cgroups for each job, so it accesses resources within the cgroup. When a job
restarts, it is looking for those resources in the previous cgroup. What is
the best way to map that out to the new cgroup created after resubmitting the
job.
Here is the erros when trying to restart from a checkpoint file:
[naveed@hpc-90-21 cp]$ dmtcp_restart --interval 120 --new-coordinator
ckpt_openssl_2ad5fc20c8bb8d9-40000-7ffcf636c0aab.dmtcp
[40000] ERROR at fileconnection.cpp:737 in refill;
REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
_path =
/sys/fs/cgroup/blkio,cpuacct,memory,freezer/slurm/uid_8688/job_509451/step_0/memory.oom_control
Message: File not found.
openssl (40000): Terminating...
(this is just a test with openssl speed)
The original joib was 509451 and this was restarted with jobid 509452, so has
a different cgroup.
I would imagine this is a solved problem due to the slurm integration work and
I am just missing something. What do others do in this situation?
Naveed
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum