WHAT: Bring in the mtcp CRS component
WHY: Add support for the MTCP checkpoint/restart service
WHERE: opal/mca/crs/mtcp
TIMEOUT: Tuesday teleconf, 2011-10-18 (about 1 week from now)
-------------------------------------------
What is MTCP?
MTCP (MultiThreaded CheckPointing; http://dmtcp.sourceforge.net) is an
LGPL single-process checkpointing package that has been under
development for seven years. It operates entirely in user space,
requiring no special kernel modules or superuser access to a system.
Using it is as simple as linking with a library and adding a call to the
mtcp_init function to your code.
MTCP is distributed as a part of the DMTCP package, and is currently
available as a Debian package.
-------------------------------------------
More details:
Open MPI MTCP integration implementation available at:
https://bitbucket.org/jsquyres/ompi-dmtcp2
The DMTCP parent project website is below:
http://dmtcp.sourceforge.net/
This RFC introduces a new CRS component for Open MPI that uses MTCP to
provide transparent checkpointing. The primary advantage of MTCP over
the existing BLCR CRS module is that it operates entirely in userspace,
meaning that any user can use it on a system without requiring special
kernel modules or superuser access to the system. Like the BLCR module,
using the MTCP CRS module is entirely transparent to the actual user
process, and requires no modification to the user program.
Jeff Hursey and Josh Squyres have been working with the DMTCP authors
(based out of the US Northeastern University in Boston, MA, USA) for
quite a while and feel that this component is ready to be brought into
the Open MPI main line for inclusion in the 1.7.x series (and possibly
the 1.5.x series?). The authors have submitted OMPI 3rd party
contribution agreements.