I stumbled across CRIU (Checkpoint/Restore In Userspace) 
https://criu.org/Main_Page <https://criu.org/Main_Page> a couple of weeks ago.  
I have not utilized it yet it; it's on my ToDo list. They claim that it’s 
packaged with most distress;  I checked RHEL/CentOS and it was there. Be 
careful of package/kernel versions; i.e  a good reason to go with the version 
included in your distro.  BLCR was last updated January 2013; back in the day, 
it worked well enough for simpler apps;  complicated MPI apps was less so.

   - geo



> On Oct 4, 2019, at 11:17 PM, Renfro, Michael <ren...@tntech.edu> wrote:
> 
> This message was sent from a non-IU address. Please exercise caution when 
> clicking links or opening attachments from external sources.
> 
> DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. 
> Don’t recall it being any trouble to install.
> 
> http://dmtcp.sourceforge.net/ <http://dmtcp.sourceforge.net/>
> 
> On Oct 4, 2019, at 9:47 PM, Eliot Moss <m...@cs.umass.edu 
> <mailto:m...@cs.umass.edu>> wrote:
> 
>> Dear slurm users --
>> 
>> I'm new to slurm (somewhat experienced with Grid Engine, though that's
>> not relevant to this post).  I have access to two slurm based clusters,
>> and have an application that (a) can be _very_long running (more than
>> 8 weeks for one execution, though the compute and I/O demands of one
>> such job are not huge by modern standards) and that (b) is not at all
>> practical to convert to do its own checkpoints.  (I am running traces
>> from the valgrind program of every memory reference and branch made
>> when running individual SPEC benchmarks; this is then piped to 8
>> downstream analyzers, mostly Java programs.)
>> 
>> From what I have read, BLCR would meet my needs for checkpointing,
>> but the admins of both clusters are reluctant to pursue BLCR support.
>> I myself am wondering whether it is still working, etc., and what it
>> means that built-in support has been removed, etc.  Can someone offer
>> a brief explanation of the status and recent history of BLCR w.r.t.
>> slurm?
>> 
>> Many thanks!   Eliot Moss, UMass Amherst Computer Science
>> 

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to