Hi Gilles,
t​​hank you again for your great answer. Our idea is to migrate tasks
between nodes, possibly individually, and other tasks still run (obviously,
if they want to communicate with "migrating" node, we should pause them).


Just to be sure if we have understood correctly, is the attached image
exact?

Cheers,
Federico
__
Federico Reghenzani
M.Eng. Student @ Politecnico di Milano
Computer Science and Engineering



2015-10-23 11:45 GMT+02:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> Gianmario,
>
> Iirc, there is one pipe between orted and each children stderr.
> stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0
> This is the way stdout/stderr from tasks end up being printed by mpirun :
> orted does i/o forwarding (aka IOF)
>
> are you trying to migrate only one task (and other tasks still run) or are
> you trying to checkpoint and restart on a different set of nodes ?
>
> Typically, a task uses shared memory for intra node communications, and
> infiniband or tcp for inter node communications.
> So if you migrate only one task, and i assume you have no virtual shared
> memory, then you need to notify its neighbors they have to switch from shm
> to ib/tcp.
> At first glance, that is much harder than moving orted and its children :
> You would "only" have to re-establish all connections and migrate the shm.
> Also, orted assumes/need its children are running on the same node, (they
> use a session dir in /tmp, orted waits SIGCHLD when its child dies,...) so
> if you migrate everything, you do not have to worry about that part.
>
> You might also want to consider some virtualization :
> If a node is running in its own vm, or its own container with a virtual
> ip, you could reuse existing infrastructure at least to migrate orted and
> its tcp/ip connections
>
> Cheers,
>
> Gilles
>
> Federico Reghenzani <federico1.reghenz...@mail.polimi.it> wrote:
> Hi Adrian and Gilles,
>
> first of all thank you for your responses. I'm working with Gianmario on
> this ambitious project.
>
> 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com>:
>
>> Gianmario,
>>
>> there was c/r support in the v1.6 series but it has been removed.
>> the current trend is to do application level checkpointing
>> (much more efficient and much smaller checkpoint file size)
>>
>> iirc, ompi took care of closing/restoring all communication, and a third
>> party checkpoint was required to checkpoint/restart *standalone* processes.
>>
>> generally speaking, mpirun and orted communicate via tcp
>> orted and MPI (intra node comms) currently use tcp but we are moving to
>> unix sockets
>> MPI tasks communicate via btl (infiniband, tcp, shared memory, ...)
>>
>>
> We have also seen that orted opens 2 pipe to each child, is it correct?
> Does orted use them to communicate with children?
>
>
>
>> imho, moving only one MPI task to an other node is much harder, not to
>> say impossible, than moving orted and its children MPI tasks to an other
>> node
>>
>>
> Mmm, I can ask you why? I mean, if we migrate the entire orted we need to
> close/reopen *mpirun-orted* and *task-task* (btl) sockets, and if we
> migrate the single task we need to close/reopen *orte-task* and
> *task-task *sockets. In both cases we have to broadcast the information
> of "changing location" of the task or orted.
>
>
>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Thursday, October 22, 2015, Gianmario Pozzi <pozzigma...@gmail.com>
>> wrote:
>>
>>> Hi everyone!
>>>
>>> My team and I are working on the possibility to checkpoint a process and
>>> restarting it on another node. We are using CRIU framework for the
>>> checkpoint/restart part, but we are facing some issues related to migration.
>>>
>>> First of all: we found out that some attempts to C/R an OMPI process
>>> have been already made in the past. Is anything related to that still
>>> supported/available/working?
>>>
>>> Then, we need to know which network communications are used at any time,
>>> in order to "pause" them during migrations (at least the ones involving the
>>> migrating node). Our code analysis makes us think that:
>>> -OpenMPI runtime (HNP<->orteds) uses orte/OOB
>>> -Running applications exchange data via ompi/BTL
>>>
>>> Is that correct? If not, can someone give us a hint?
>>>
>>> Questions on how to update topology info may be yet to come.
>>>
>>> Thank you guys!
>>>
>>> Gianmario
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/10/18242.php
>>
>
>
> Cheers,
> Federico
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18253.php
>

Reply via email to