Hi Adrian and Gilles, first of all thank you for your responses. I'm working with Gianmario on this ambitious project.
2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < [email protected]>: > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. > the current trend is to do application level checkpointing > (much more efficient and much smaller checkpoint file size) > > iirc, ompi took care of closing/restoring all communication, and a third > party checkpoint was required to checkpoint/restart *standalone* processes. > > generally speaking, mpirun and orted communicate via tcp > orted and MPI (intra node comms) currently use tcp but we are moving to > unix sockets > MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) > > We have also seen that orted opens 2 pipe to each child, is it correct? Does orted use them to communicate with children? > imho, moving only one MPI task to an other node is much harder, not to say > impossible, than moving orted and its children MPI tasks to an other node > > Mmm, I can ask you why? I mean, if we migrate the entire orted we need to close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we migrate the single task we need to close/reopen *orte-task* and *task-task *sockets. In both cases we have to broadcast the information of "changing location" of the task or orted. > Cheers, > > Gilles > > > On Thursday, October 22, 2015, Gianmario Pozzi <[email protected]> > wrote: > >> Hi everyone! >> >> My team and I are working on the possibility to checkpoint a process and >> restarting it on another node. We are using CRIU framework for the >> checkpoint/restart part, but we are facing some issues related to migration. >> >> First of all: we found out that some attempts to C/R an OMPI process have >> been already made in the past. Is anything related to that still >> supported/available/working? >> >> Then, we need to know which network communications are used at any time, >> in order to "pause" them during migrations (at least the ones involving the >> migrating node). Our code analysis makes us think that: >> -OpenMPI runtime (HNP<->orteds) uses orte/OOB >> -Running applications exchange data via ompi/BTL >> >> Is that correct? If not, can someone give us a hint? >> >> Questions on how to update topology info may be yet to come. >> >> Thank you guys! >> >> Gianmario >> > > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18242.php > Cheers, Federico __ Federico Reghenzani M.Eng. Student @ Politecnico di Milano Computer Science and Engineering
