Each module has the opportunity to provide an ft_event function, that is supposedly called when a change in the module behavior is necessary. Thus, it is relatively easy to let the BTL knows about the fact that a particular destination process will migrate to a new location.
George. On Fri, Oct 23, 2015 at 5:45 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Gianmario, > > Iirc, there is one pipe between orted and each children stderr. > stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 > This is the way stdout/stderr from tasks end up being printed by mpirun : > orted does i/o forwarding (aka IOF) > > are you trying to migrate only one task (and other tasks still run) or are > you trying to checkpoint and restart on a different set of nodes ? > > Typically, a task uses shared memory for intra node communications, and > infiniband or tcp for inter node communications. > So if you migrate only one task, and i assume you have no virtual shared > memory, then you need to notify its neighbors they have to switch from shm > to ib/tcp. > At first glance, that is much harder than moving orted and its children : > You would "only" have to re-establish all connections and migrate the shm. > Also, orted assumes/need its children are running on the same node, (they > use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so > if you migrate everything, you do not have to worry about that part. > > You might also want to consider some virtualization : > If a node is running in its own vm, or its own container with a virtual > ip, you could reuse existing infrastructure at least to migrate orted and > its tcp/ip connections > > Cheers, > > Gilles > > Federico Reghenzani <federico1.reghenz...@mail.polimi.it> wrote: > Hi Adrian and Gilles, > > first of all thank you for your responses. I'm working with Gianmario on > this ambitious project. > > 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < > gilles.gouaillar...@gmail.com>: > >> Gianmario, >> >> there was c/r support in the v1.6 series but it has been removed. >> the current trend is to do application level checkpointing >> (much more efficient and much smaller checkpoint file size) >> >> iirc, ompi took care of closing/restoring all communication, and a third >> party checkpoint was required to checkpoint/restart *standalone* processes. >> >> generally speaking, mpirun and orted communicate via tcp >> orted and MPI (intra node comms) currently use tcp but we are moving to >> unix sockets >> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...) >> >> > We have also seen that orted opens 2 pipe to each child, is it correct? > Does orted use them to communicate with children? > > > >> imho, moving only one MPI task to an other node is much harder, not to >> say impossible, than moving orted and its children MPI tasks to an other >> node >> >> > Mmm, I can ask you why? I mean, if we migrate the entire orted we need to > close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we > migrate the single task we need to close/reopen *orte-task* and > *task-task *sockets. In both cases we have to broadcast the information > of "changing location" of the task or orted. > > > >> Cheers, >> >> Gilles >> >> >> On Thursday, October 22, 2015, Gianmario Pozzi <pozzigma...@gmail.com> >> wrote: >> >>> Hi everyone! >>> >>> My team and I are working on the possibility to checkpoint a process and >>> restarting it on another node. We are using CRIU framework for the >>> checkpoint/restart part, but we are facing some issues related to migration. >>> >>> First of all: we found out that some attempts to C/R an OMPI process >>> have been already made in the past. Is anything related to that still >>> supported/available/working? >>> >>> Then, we need to know which network communications are used at any time, >>> in order to "pause" them during migrations (at least the ones involving the >>> migrating node). Our code analysis makes us think that: >>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB >>> -Running applications exchange data via ompi/BTL >>> >>> Is that correct? If not, can someone give us a hint? >>> >>> Questions on how to update topology info may be yet to come. >>> >>> Thank you guys! >>> >>> Gianmario >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php >> > > > Cheers, > Federico > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18253.php >