Re: [slurm-users] Status of BLCR?

2019-10-06 Thread Chris Samuel

On 4/10/19 7:46 pm, Eliot Moss wrote:


 From what I have read, BLCR would meet my needs for checkpointing,
but the admins of both clusters are reluctant to pursue BLCR support.
I myself am wondering whether it is still working, etc., and what it
means that built-in support has been removed, etc.


BLCR is no longer maintained and SchedMD have removed the support that 
Slurm used to have for it. As Michael mentioned DMTCP is worth checking 
out (I've not used it personally, but it does seem to be actively 
developed). From my discussions with SchedMD they'd rather not build 
support in to Slurm for any particular solution.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Status of BLCR?

2019-10-04 Thread Renfro, Michael
DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. 
Don’t recall it being any trouble to install.

http://dmtcp.sourceforge.net/

On Oct 4, 2019, at 9:47 PM, Eliot Moss 
mailto:m...@cs.umass.edu>> wrote:

Dear slurm users --

I'm new to slurm (somewhat experienced with Grid Engine, though that's
not relevant to this post).  I have access to two slurm based clusters,
and have an application that (a) can be _very_long running (more than
8 weeks for one execution, though the compute and I/O demands of one
such job are not huge by modern standards) and that (b) is not at all
practical to convert to do its own checkpoints.  (I am running traces
from the valgrind program of every memory reference and branch made
when running individual SPEC benchmarks; this is then piped to 8
downstream analyzers, mostly Java programs.)

From what I have read, BLCR would meet my needs for checkpointing,
but the admins of both clusters are reluctant to pursue BLCR support.
I myself am wondering whether it is still working, etc., and what it
means that built-in support has been removed, etc.  Can someone offer
a brief explanation of the status and recent history of BLCR w.r.t.
slurm?

Many thanks!   Eliot Moss, UMass Amherst Computer Science



[slurm-users] Status of BLCR?

2019-10-04 Thread Eliot Moss

Dear slurm users --

I'm new to slurm (somewhat experienced with Grid Engine, though that's
not relevant to this post).  I have access to two slurm based clusters,
and have an application that (a) can be _very_long running (more than
8 weeks for one execution, though the compute and I/O demands of one
such job are not huge by modern standards) and that (b) is not at all
practical to convert to do its own checkpoints.  (I am running traces
from the valgrind program of every memory reference and branch made
when running individual SPEC benchmarks; this is then piped to 8
downstream analyzers, mostly Java programs.)

From what I have read, BLCR would meet my needs for checkpointing,
but the admins of both clusters are reluctant to pursue BLCR support.
I myself am wondering whether it is still working, etc., and what it
means that built-in support has been removed, etc.  Can someone offer
a brief explanation of the status and recent history of BLCR w.r.t.
slurm?

Many thanks!   Eliot Moss, UMass Amherst Computer Science