Hi Naveed,

I'm not sure if our current slurm integration provides support for cgroups.
I've cc'd Artem to confirm.

Having said that, I think we'd need to create a new DMTCP plugin to handle
cgroup changes. If the application is just trying to access some file in
the cgroup hierarchy, we can potentially use the pathvirt plugin to do path
translations for us during restart.

Rohan, do you have any suggestions?

Kapil

On Tue, Jun 19, 2018 at 12:40 PM Near-Ansari, Naveed <[email protected]>
wrote:

> Hi,
>
> I have been testing dmctp for c/r for a cluster.  When I test manually, it
> runs fine.  Jobs restart as expected and is generally impressive.
>
> When I try to integrate it for use within slurm I run into issues upon
> restarting a process.  It seems to be a cgroup issue.  We have slurm create
> cgroups for each job, so it accesses resources within the cgroup.  When a
> job restarts, it is looking for those resources in the previous cgroup.
> What is the best way to map that out to the new cgroup created after
> resubmitting the job.
>
> Here is the erros when trying to restart from a checkpoint file:
>
> [naveed@hpc-90-21 cp]$  dmtcp_restart  --interval 120 --new-coordinator
> ckpt_openssl_2ad5fc20c8bb8d9-40000-7ffcf636c0aab.dmtcp
> [40000] ERROR at fileconnection.cpp:737 in refill;
> REASON='JASSERT(jalib::Filesystem::FileExists(_path)) failed'
>      _path =
> /sys/fs/cgroup/blkio,cpuacct,memory,freezer/slurm/uid_8688/job_509451/step_0/memory.oom_control
> Message: File not found.
> openssl (40000): Terminating...
>
> (this is just a test with openssl speed)
>
> The original joib was 509451 and this was restarted with jobid 509452, so
> has a different cgroup.
>
> I would imagine this is a solved problem due to the slurm integration work
> and I am just missing something.  What do others do in this situation?
>
> Naveed
>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to