Re: [slurm-users] Status of BLCR?
On 4/10/19 7:46 pm, Eliot Moss wrote: From what I have read, BLCR would meet my needs for checkpointing, but the admins of both clusters are reluctant to pursue BLCR support. I myself am wondering whether it is still working, etc., and what it means that built-in support has been removed, etc. BLCR is no longer maintained and SchedMD have removed the support that Slurm used to have for it. As Michael mentioned DMTCP is worth checking out (I've not used it personally, but it does seem to be actively developed). From my discussions with SchedMD they'd rather not build support in to Slurm for any particular solution. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Status of BLCR?
DMTCP might be an option? Pretty sure there are RPMs for it in RHEL/CentOS 7. Don’t recall it being any trouble to install. http://dmtcp.sourceforge.net/ On Oct 4, 2019, at 9:47 PM, Eliot Moss mailto:m...@cs.umass.edu>> wrote: Dear slurm users -- I'm new to slurm (somewhat experienced with Grid Engine, though that's not relevant to this post). I have access to two slurm based clusters, and have an application that (a) can be _very_long running (more than 8 weeks for one execution, though the compute and I/O demands of one such job are not huge by modern standards) and that (b) is not at all practical to convert to do its own checkpoints. (I am running traces from the valgrind program of every memory reference and branch made when running individual SPEC benchmarks; this is then piped to 8 downstream analyzers, mostly Java programs.) From what I have read, BLCR would meet my needs for checkpointing, but the admins of both clusters are reluctant to pursue BLCR support. I myself am wondering whether it is still working, etc., and what it means that built-in support has been removed, etc. Can someone offer a brief explanation of the status and recent history of BLCR w.r.t. slurm? Many thanks! Eliot Moss, UMass Amherst Computer Science
[slurm-users] Status of BLCR?
Dear slurm users -- I'm new to slurm (somewhat experienced with Grid Engine, though that's not relevant to this post). I have access to two slurm based clusters, and have an application that (a) can be _very_long running (more than 8 weeks for one execution, though the compute and I/O demands of one such job are not huge by modern standards) and that (b) is not at all practical to convert to do its own checkpoints. (I am running traces from the valgrind program of every memory reference and branch made when running individual SPEC benchmarks; this is then piped to 8 downstream analyzers, mostly Java programs.) From what I have read, BLCR would meet my needs for checkpointing, but the admins of both clusters are reluctant to pursue BLCR support. I myself am wondering whether it is still working, etc., and what it means that built-in support has been removed, etc. Can someone offer a brief explanation of the status and recent history of BLCR w.r.t. slurm? Many thanks! Eliot Moss, UMass Amherst Computer Science