Thanks Alex.  Can you answer George's other question about "hand waving"?  



On Oct 7, 2011, at 3:59 PM, Alex Brick wrote:

> Yes, we were trying to give some background on the project and use consistent 
> branding.  Our package is called DMTCP,  which includes two components: DMTCP 
> (a distributed checkpointer), and MTCP (a single process checkpointer, which 
> can be used both standalone and internally by DMTCP).
> 
> This RFC is for a CRS module that uses only the MTCP component.
> 
> 
> -- Alex
> 
> Josh Hursey <jjhur...@open-mpi.org> wrote:
> 
>>> From what I have seen during development, this RFC integrates the MTCP
>> single process checkpointer into the C/R infrastructure of Open MPI.
>> The MTCP component of the DMTCP project can be used in insolation,
>> which is what they are integrating. So they can use DMTCP to
>> checkpoint/restart an unmodified Open MPI, but only over certain
>> networks. By integrating the MTCP checkpointer as a CRS component they
>> use Open MPI to coordinate across processes, and gain support for a
>> larger number of networks (e.g., IB, MX).
>> 
>> Alex, does that sound about right?
>> 
>> -- Josh
>> 
>> 
>> On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
>>> Alex,
>>> 
>>> It looks like there is a mismatch between what you propose to achieve and 
>>> the text in your RFC. You propose to add a new single-process 
>>> checkpoint-restart mechanism (MTCP), to the ones already provided in Open 
>>> MPI. However, most of the text in your RFC is about DMTCP, which is another 
>>> layer on top of MTCP capable of checkpoint/restarting distributed 
>>> application.
>>> 
>>> I would like to understand what this RFC is really about: MTCP or DMTCP?
>>> 
>>>  george.
>>> 
>>> On Oct 6, 2011, at 02:58 , Alex Brick wrote:
>>> 
>>>> WHAT: Bring in the mtcp CRS component
>>>> 
>>>> WHY: Add support for the MTCP checkpoint/restart service
>>>> 
>>>> WHERE: opal/mca/crs/mtcp
>>>> 
>>>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now)
>>>> 
>>>> -------------------------------------------
>>>> What is MTCP?
>>>> 
>>>> DMTCP (Distributed MultiThreaded CheckPointing, 
>>>> http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing 
>>>> package that has been under development for seven years. It operates 
>>>> entirely in user space, with no kernel modules, or modifications to the 
>>>> target application.  If used in the simplest possible way, it works as:
>>>> 
>>>> dmtcp_checkpoint ./a.out
>>>> dmtcp_command --checkpoint
>>>> dmtcp_restart ckpt_a.out_*.dmtcp
>>>> 
>>>> DMTCP is contagious.  Any calls to fork(), pthread_create(), or "ssh",
>>>> are recognized by DMTCP, and it maintains those threads, and local and
>>>> remote processes under checkpoint control.  At checkpoint time, it also
>>>> generates a script, dmtcp_restart_script.sh, that can restart a 
>>>> distributed computation.  As a sign of its maturity, it can also 
>>>> checkpoint Open MPI "from on top":  dmtcp_checkpoint mpirun hello_mpi
>>>> 
>>>> The MTCP component of DMTCP is the single-process component.  It is used
>>>> both internally by DMTCP as well as directly by users only interested in
>>>> checkpointing a single process.  This second feature was used in order to 
>>>> develop an Open MPI module for the Open MPI checkpoint-restart service 
>>>> similar to BLCR, except that no kernel modules are required.
>>>> 
>>>> DMTCP is currently a Debian package (Debian testing), and is planned also 
>>>> for Fedora and openSuSe.  These packages also provide the MTCP component 
>>>> for Open MPI.
>>>> 
>>>> -------------------------------------------
>>>> More details:
>>>> 
>>>> Open MPI MTCP integration implementation available at:
>>>> 
>>>>  https://bitbucket.org/jsquyres/ompi-dmtcp2
>>>> 
>>>> The DMTCP parent project website is below:
>>>> 
>>>>  http://dmtcp.sourceforge.net/
>>>> 
>>>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports 
>>>> user-level, transparent checkpoint/restart of a variety of sequential and 
>>>> parallel programs.  In Open MPI terms, this contribution is an alternative 
>>>> to the BLCR CRS module, meaning that users can use DMTCP to checkpoint 
>>>> their applications instead of BLCR.
>>>> 
>>>> The MTCP component is currently restricted to supporting communication 
>>>> over sockets and shared memory.  In an effort to support a wider range of 
>>>> networks (e.g., InfiniBand, Myrinet), they have created a CRS component to 
>>>> hook into Open MPI's checkpoint/restart infrastructure. The MTCP 
>>>> user-level checkpoint/restart service is the single process checkpoint 
>>>> kernel of the DMTCP project.  The MTCP kernel is what is used in the mtcp 
>>>> CRS component.
>>>> 
>>>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors 
>>>> (based out of the US Northeastern University in Boston, MA, USA) for quite 
>>>> a while and feel that this component is ready to be brought into the Open 
>>>> MPI main line for inclusion in the 1.7.x series (and possibly the 1.5.x 
>>>> series?).  The authors have submitted OMPI 3rd party contribution 
>>>> agreements.
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to