Re: [slurm-users] Status of BLCR?

2019-10-06 Thread Chris Samuel

On 4/10/19 7:46 pm, Eliot Moss wrote:


 From what I have read, BLCR would meet my needs for checkpointing,
but the admins of both clusters are reluctant to pursue BLCR support.
I myself am wondering whether it is still working, etc., and what it
means that built-in support has been removed, etc.


BLCR is no longer maintained and SchedMD have removed the support that 
Slurm used to have for it. As Michael mentioned DMTCP is worth checking 
out (I've not used it personally, but it does seem to be actively 
developed). From my discussions with SchedMD they'd rather not build 
support in to Slurm for any particular solution.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] [External] Re: Status of BLCR?

2019-10-06 Thread Eliot Moss

On 10/6/2019 9:23 AM, George Wm Turner wrote:
I stumbled across CRIU (Checkpoint/Restore In Userspace) https://criu.org/Main_Page a couple of 
weeks ago.  I have not utilized it yet it; it's on my ToDo list. They claim that it’s packaged with 
most distress;  I checked RHEL/CentOS and it was there. Be careful of package/kernel versions; i.e 
  a good reason to go with the version included in your distro.  BLCR was last updated January 2013; 
back in the day, it worked well enough for simpler apps;  complicated MPI apps was less so.


Thanks, George.  I've installed it and started looking at it.  At present
I am applying it to a Grid Engine job, and have not figured out how to make
it restore successfully.  (Checkpointing goes all right, but gives a minor
warning.)  It does seem to require running as root, and of course my file
systems are NFS mounted, which leads to issues.  (Since I am just running
some scratch things for testing, using 777 permissions (ouch!) seems to
allow checkpointing to proceed.

I do need to understand a bit more of how it works and what flags I need :-) ...

It seems it needs root privilege to work, though maybe doing suid to root
is enough (I've not tried setting that on the executable).

Regards - EM



Re: [slurm-users] [External] Re: Status of BLCR?

2019-10-06 Thread George Wm Turner
I stumbled across CRIU (Checkpoint/Restore In Userspace) 
https://criu.org/Main_Page  a couple of weeks ago.  
I have not utilized it yet it; it's on my ToDo list. They claim that it’s 
packaged with most distress;  I checked RHEL/CentOS and it was there. Be 
careful of package/kernel versions; i.e  a good reason to go with the version 
included in your distro.  BLCR was last updated January 2013; back in the day, 
it worked well enough for simpler apps;  complicated MPI apps was less so.

   - geo



> On Oct 4, 2019, at 11:17 PM, Renfro, Michael  wrote:
> 
> This message was sent from a non-IU address. Please exercise caution when 
> clicking links or opening attachments from external sources.
> 
> DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. 
> Don’t recall it being any trouble to install.
> 
> http://dmtcp.sourceforge.net/ 
> 
> On Oct 4, 2019, at 9:47 PM, Eliot Moss  > wrote:
> 
>> Dear slurm users --
>> 
>> I'm new to slurm (somewhat experienced with Grid Engine, though that's
>> not relevant to this post).  I have access to two slurm based clusters,
>> and have an application that (a) can be _very_long running (more than
>> 8 weeks for one execution, though the compute and I/O demands of one
>> such job are not huge by modern standards) and that (b) is not at all
>> practical to convert to do its own checkpoints.  (I am running traces
>> from the valgrind program of every memory reference and branch made
>> when running individual SPEC benchmarks; this is then piped to 8
>> downstream analyzers, mostly Java programs.)
>> 
>> From what I have read, BLCR would meet my needs for checkpointing,
>> but the admins of both clusters are reluctant to pursue BLCR support.
>> I myself am wondering whether it is still working, etc., and what it
>> means that built-in support has been removed, etc.  Can someone offer
>> a brief explanation of the status and recent history of BLCR w.r.t.
>> slurm?
>> 
>> Many thanks!   Eliot Moss, UMass Amherst Computer Science
>> 



smime.p7s
Description: S/MIME cryptographic signature