I think he's asking how MTCP does this without involvement by the MPI 
implementation.


On Oct 7, 2011, at 9:44 PM, Alex Brick wrote:

> I'm a little unclear on this comment.
> 
> DMTCP currently supports checkpointing and restoring sockets over TCP, and we 
> are actively working on Infiniband support.  However, we feel that value is 
> added by also working as an Open MPI module, where Open MPI handles all of 
> the network communication, and our module simply handles checkpointing the 
> individual processes.  This enables people to use our user-level 
> checkpointing tools with other networks by using Open MPI.
> 
> What exactly is your question?
> 
> 
> -- Alex
> 
> George Bosilca <bosi...@eecs.utk.edu> wrote:
> 
>> Way too much hands waving here.
>> 
>> When you say certain networks you mean TCP and potentially SM. However, I 
>> doubt even TCP can be fully supported. Not without the preconnect option … 
>> or a mean to update the modes information.
>> 
>> george.
>> 
>> On Oct 7, 2011, at 14:56 , Josh Hursey wrote:
>> 
>>>> From what I have seen during development, this RFC integrates the MTCP
>>> single process checkpointer into the C/R infrastructure of Open MPI.
>>> The MTCP component of the DMTCP project can be used in insolation,
>>> which is what they are integrating. So they can use DMTCP to
>>> checkpoint/restart an unmodified Open MPI, but only over certain
>>> networks. By integrating the MTCP checkpointer as a CRS component they
>>> use Open MPI to coordinate across processes, and gain support for a
>>> larger number of networks (e.g., IB, MX).
>>> 
>>> Alex, does that sound about right?
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
>>>> Alex,
>>>> 
>>>> It looks like there is a mismatch between what you propose to achieve and 
>>>> the text in your RFC. You propose to add a new single-process 
>>>> checkpoint-restart mechanism (MTCP), to the ones already provided in Open 
>>>> MPI. However, most of the text in your RFC is about DMTCP, which is 
>>>> another layer on top of MTCP capable of checkpoint/restarting distributed 
>>>> application.
>>>> 
>>>> I would like to understand what this RFC is really about: MTCP or DMTCP?
>>>> 
>>>> george.
>>>> 
>>>> On Oct 6, 2011, at 02:58 , Alex Brick wrote:
>>>> 
>>>>> WHAT: Bring in the mtcp CRS component
>>>>> 
>>>>> WHY: Add support for the MTCP checkpoint/restart service
>>>>> 
>>>>> WHERE: opal/mca/crs/mtcp
>>>>> 
>>>>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now)
>>>>> 
>>>>> -------------------------------------------
>>>>> What is MTCP?
>>>>> 
>>>>> DMTCP (Distributed MultiThreaded CheckPointing, 
>>>>> http://dmtcp.sourceforge.net) is a mature open source (LGPL) 
>>>>> checkpointing package that has been under development for seven years. It 
>>>>> operates entirely in user space, with no kernel modules, or modifications 
>>>>> to the target application.  If used in the simplest possible way, it 
>>>>> works as:
>>>>> 
>>>>> dmtcp_checkpoint ./a.out
>>>>> dmtcp_command --checkpoint
>>>>> dmtcp_restart ckpt_a.out_*.dmtcp
>>>>> 
>>>>> DMTCP is contagious.  Any calls to fork(), pthread_create(), or "ssh",
>>>>> are recognized by DMTCP, and it maintains those threads, and local and
>>>>> remote processes under checkpoint control.  At checkpoint time, it also
>>>>> generates a script, dmtcp_restart_script.sh, that can restart a 
>>>>> distributed computation.  As a sign of its maturity, it can also 
>>>>> checkpoint Open MPI "from on top":  dmtcp_checkpoint mpirun hello_mpi
>>>>> 
>>>>> The MTCP component of DMTCP is the single-process component.  It is used
>>>>> both internally by DMTCP as well as directly by users only interested in
>>>>> checkpointing a single process.  This second feature was used in order to 
>>>>> develop an Open MPI module for the Open MPI checkpoint-restart service 
>>>>> similar to BLCR, except that no kernel modules are required.
>>>>> 
>>>>> DMTCP is currently a Debian package (Debian testing), and is planned also 
>>>>> for Fedora and openSuSe.  These packages also provide the MTCP component 
>>>>> for Open MPI.
>>>>> 
>>>>> -------------------------------------------
>>>>> More details:
>>>>> 
>>>>> Open MPI MTCP integration implementation available at:
>>>>> 
>>>>> https://bitbucket.org/jsquyres/ompi-dmtcp2
>>>>> 
>>>>> The DMTCP parent project website is below:
>>>>> 
>>>>> http://dmtcp.sourceforge.net/
>>>>> 
>>>>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports 
>>>>> user-level, transparent checkpoint/restart of a variety of sequential and 
>>>>> parallel programs.  In Open MPI terms, this contribution is an 
>>>>> alternative to the BLCR CRS module, meaning that users can use DMTCP to 
>>>>> checkpoint their applications instead of BLCR.
>>>>> 
>>>>> The MTCP component is currently restricted to supporting communication 
>>>>> over sockets and shared memory.  In an effort to support a wider range of 
>>>>> networks (e.g., InfiniBand, Myrinet), they have created a CRS component 
>>>>> to hook into Open MPI's checkpoint/restart infrastructure. The MTCP 
>>>>> user-level checkpoint/restart service is the single process checkpoint 
>>>>> kernel of the DMTCP project.  The MTCP kernel is what is used in the mtcp 
>>>>> CRS component.
>>>>> 
>>>>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors 
>>>>> (based out of the US Northeastern University in Boston, MA, USA) for 
>>>>> quite a while and feel that this component is ready to be brought into 
>>>>> the Open MPI main line for inclusion in the 1.7.x series (and possibly 
>>>>> the 1.5.x series?).  The authors have submitted OMPI 3rd party 
>>>>> contribution agreements.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to