I think he's asking how MTCP does this without involvement by the MPI implementation.
On Oct 7, 2011, at 9:44 PM, Alex Brick wrote: > I'm a little unclear on this comment. > > DMTCP currently supports checkpointing and restoring sockets over TCP, and we > are actively working on Infiniband support. However, we feel that value is > added by also working as an Open MPI module, where Open MPI handles all of > the network communication, and our module simply handles checkpointing the > individual processes. This enables people to use our user-level > checkpointing tools with other networks by using Open MPI. > > What exactly is your question? > > > -- Alex > > George Bosilca <bosi...@eecs.utk.edu> wrote: > >> Way too much hands waving here. >> >> When you say certain networks you mean TCP and potentially SM. However, I >> doubt even TCP can be fully supported. Not without the preconnect option … >> or a mean to update the modes information. >> >> george. >> >> On Oct 7, 2011, at 14:56 , Josh Hursey wrote: >> >>>> From what I have seen during development, this RFC integrates the MTCP >>> single process checkpointer into the C/R infrastructure of Open MPI. >>> The MTCP component of the DMTCP project can be used in insolation, >>> which is what they are integrating. So they can use DMTCP to >>> checkpoint/restart an unmodified Open MPI, but only over certain >>> networks. By integrating the MTCP checkpointer as a CRS component they >>> use Open MPI to coordinate across processes, and gain support for a >>> larger number of networks (e.g., IB, MX). >>> >>> Alex, does that sound about right? >>> >>> -- Josh >>> >>> >>> On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: >>>> Alex, >>>> >>>> It looks like there is a mismatch between what you propose to achieve and >>>> the text in your RFC. You propose to add a new single-process >>>> checkpoint-restart mechanism (MTCP), to the ones already provided in Open >>>> MPI. However, most of the text in your RFC is about DMTCP, which is >>>> another layer on top of MTCP capable of checkpoint/restarting distributed >>>> application. >>>> >>>> I would like to understand what this RFC is really about: MTCP or DMTCP? >>>> >>>> george. >>>> >>>> On Oct 6, 2011, at 02:58 , Alex Brick wrote: >>>> >>>>> WHAT: Bring in the mtcp CRS component >>>>> >>>>> WHY: Add support for the MTCP checkpoint/restart service >>>>> >>>>> WHERE: opal/mca/crs/mtcp >>>>> >>>>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now) >>>>> >>>>> ------------------------------------------- >>>>> What is MTCP? >>>>> >>>>> DMTCP (Distributed MultiThreaded CheckPointing, >>>>> http://dmtcp.sourceforge.net) is a mature open source (LGPL) >>>>> checkpointing package that has been under development for seven years. It >>>>> operates entirely in user space, with no kernel modules, or modifications >>>>> to the target application. If used in the simplest possible way, it >>>>> works as: >>>>> >>>>> dmtcp_checkpoint ./a.out >>>>> dmtcp_command --checkpoint >>>>> dmtcp_restart ckpt_a.out_*.dmtcp >>>>> >>>>> DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh", >>>>> are recognized by DMTCP, and it maintains those threads, and local and >>>>> remote processes under checkpoint control. At checkpoint time, it also >>>>> generates a script, dmtcp_restart_script.sh, that can restart a >>>>> distributed computation. As a sign of its maturity, it can also >>>>> checkpoint Open MPI "from on top": dmtcp_checkpoint mpirun hello_mpi >>>>> >>>>> The MTCP component of DMTCP is the single-process component. It is used >>>>> both internally by DMTCP as well as directly by users only interested in >>>>> checkpointing a single process. This second feature was used in order to >>>>> develop an Open MPI module for the Open MPI checkpoint-restart service >>>>> similar to BLCR, except that no kernel modules are required. >>>>> >>>>> DMTCP is currently a Debian package (Debian testing), and is planned also >>>>> for Fedora and openSuSe. These packages also provide the MTCP component >>>>> for Open MPI. >>>>> >>>>> ------------------------------------------- >>>>> More details: >>>>> >>>>> Open MPI MTCP integration implementation available at: >>>>> >>>>> https://bitbucket.org/jsquyres/ompi-dmtcp2 >>>>> >>>>> The DMTCP parent project website is below: >>>>> >>>>> http://dmtcp.sourceforge.net/ >>>>> >>>>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports >>>>> user-level, transparent checkpoint/restart of a variety of sequential and >>>>> parallel programs. In Open MPI terms, this contribution is an >>>>> alternative to the BLCR CRS module, meaning that users can use DMTCP to >>>>> checkpoint their applications instead of BLCR. >>>>> >>>>> The MTCP component is currently restricted to supporting communication >>>>> over sockets and shared memory. In an effort to support a wider range of >>>>> networks (e.g., InfiniBand, Myrinet), they have created a CRS component >>>>> to hook into Open MPI's checkpoint/restart infrastructure. The MTCP >>>>> user-level checkpoint/restart service is the single process checkpoint >>>>> kernel of the DMTCP project. The MTCP kernel is what is used in the mtcp >>>>> CRS component. >>>>> >>>>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors >>>>> (based out of the US Northeastern University in Boston, MA, USA) for >>>>> quite a while and feel that this component is ready to be brought into >>>>> the Open MPI main line for inclusion in the 1.7.x series (and possibly >>>>> the 1.5.x series?). The authors have submitted OMPI 3rd party >>>>> contribution agreements. >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> >>> >>> >>> -- >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/