Colleagues, I am late to this thread. (It brings me back to my days running checkpoint restart on an IBM 370, which very useful for very, very long jobs). A search for "linux checkpoint restore" retrieved information about CIRU (Checkpoint/Restore in user space) which sounds a lot like the facility I used on the IBM 370. It appears to allow a user's process to be stopped, have its state backed up and then restarted. Perhaps this would solve (at least for Linux users of R or RStudio) the request to have checkpoint restart ability in an R program.
Please let me know if you agree. John ________________________________________ From: R-help <r-help-boun...@r-project.org> on behalf of Andy Jacobson via R-help <r-help@r-project.org> Sent: Tuesday, December 14, 2021 8:59 PM To: Henrik Bengtsson Cc: Greg Minshall; Andy Jacobson via R-help; Andy Jacobson Subject: Re: [R] checkpointing I have been using DMTCP successfully for a long-running optim() task. This is a single-core process running on a large linux cluster with slurm as the job manager. This cluster places an 8-hour limit on individual jobs, and since my cost function takes 11 minutes to compute, I need many such jobs run sequentially. To make DMTCP work, I have had to rework file I/O to avoid references to temporary files written to /tmp, but other than that...optim() is checkpointed just before 8 hours is up, and then resumed successfully in a subsequent batch job running on a different core of the cluster. While I have an answer for my particular task, it would still be useful to checkpoint using the scheme Henrik suggests. Thanks all for the interesting conversation! -Andy On 12/14/21 5:39 PM, Henrik Bengtsson wrote: > On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson <a...@yovo.org> wrote: >> >> Those are good points, Duncan. I am experimenting with a nice checkpointing >> tool called DMTCP. It operates on the system level but is quite >> OS-dependent. It can be found at >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdmtcp.sourceforge.net%2Findex.html&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=D7knPv4UR%2FyMl%2BwSBsHeYwnxdBGU4uuCwqyPxXgjbzg%3D&reserved=0. >> >> Still, it would be nice to be able to checkpoint calls within R to >> potentially long-running processes like optim(). > > Teasing idea. Imagine if we could come up with some de-facto standard > API for this and that such a framework could be called automatically > by R. Something similar to how user interrupts are checked (e.g. > R_CheckUserInterrupt()) on a regular basis by the R engine and > through-out the R code. That could help troubleshooting and debugging, > e.g. sending the checkpoint to someone else or going backwards in > time. > > Pasting in the below since I failed to hit Reply *All* the other day, > and it was only Richard who got it: > > A few weeks ago, I played around with DMTCP (Distributed MultiThreaded > CheckPointing ) for Linux > (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmtcp%2Fdmtcp&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=xwfnXt1KJtPHUTW3cyhtgSmdeIiFl4VaiRJJAWRc5p4%3D&reserved=0). > I'm > sharing in case someone is interested in investigating this further. > Also, somewhere on the DMTCP wiki, they asked for testing with R by > more experienced users. > > "DMTCP is a tool to transparently checkpoint the state of multiple > simultaneous applications, including multi-threaded and distributed > applications. It operates directly on the user binary executable, > without any Linux kernel modules or other kernel modifications." > > They seem to be able to run this with HPC jobs, open files, Linux > containers, and even MPI, and so on. I've only tested it very quickly > with interactive R and it seems to work. Obviously more testing needs > to be done to identify when it doesn't work. For example, I'd have a > hard time it would work out of the box with local parallel PSOCK > workers. They mention "plug-ins", so maybe there's a way to adding > support for specific use cases on a one by one. > > Different academic HPC environment appear to use it, e.g. > > * https://docs.nersc.gov/development/checkpoint-restart/dmtcp/ > * > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.orc.gmu.edu%2Fmkdocs%2FCreating_Checkpoints_%2528DMTCP%2529%2F&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=MtPGzaAIKl7RytoJ3%2FCC2o583GHrKz8CkEtLgeMz63I%3D&reserved=0 > * > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.york.ac.uk%2Fdisplay%2FRCS%2FVK21%2529%2BCheckpointing%2Bwith%2BDMTCP&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=wUI8AqONnPtKnW5JP1lXAOx%2FO%2Bkuve6dn8QC7cpb9S8%3D&reserved=0 > > That's all I have time for now, > > Henrik > >> >> -Andy >> >> On 12/13/21 11:51 AM, Duncan Murdoch wrote: >>> On 13/12/2021 12:58 p.m., Greg Minshall wrote: >>>> Jeff, >>>> >>>>> This sounds like an OS feature, not an R feature... certainly not a >>>>> portable R feature. >>>> >>>> i'm not arguing for it, but this seems to me like something that could >>>> be a language feature. >>>> >>> >>> R functions can call libraries written in other languages, and can start >>> processes, etc. R doesn't know everything going on in every function call, >>> and would have a lot of trouble saving it. >>> >>> If you added some limitations, e.g. a process that periodically has its >>> entire state stored in R variables, then it would be a lot easier. >>> >>> Duncan Murdoch >> >> -- >> Andy Jacobson >> a...@yovo.org >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BzjDX4tvLr%2FlvpoMOiQIX75ojE4WRLEkflfzf%2F0h7Bg%3D&reserved=0 >> PLEASE do read the posting guide >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fz3JNl6S2gCGCVT6cPOSoHIOP%2F%2FTaOIqcHf6Vd%2Fbm3U%3D&reserved=0 >> and provide commented, minimal, self-contained, reproducible code. -- Andy Jacobson andy.jacob...@noaa.gov NOAA Global Monitoring Lab 325 Broadway Boulder, Colorado 80305 303/497-4916 ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BzjDX4tvLr%2FlvpoMOiQIX75ojE4WRLEkflfzf%2F0h7Bg%3D&reserved=0 PLEASE do read the posting guide https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=04%7C01%7CJSorkin%40som.umaryland.edu%7C4ea4b0ad997b48ae4cd808d9bf6ea251%7C717009a620de461a88940312a395cac9%7C0%7C0%7C637751304208997014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=fz3JNl6S2gCGCVT6cPOSoHIOP%2F%2FTaOIqcHf6Vd%2Fbm3U%3D&reserved=0 and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.