Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-11-13 Thread Federico Reghenzani
2015-10-26 8:04 GMT+01:00 Gilles Gouaillardet : > Federico, > > that looks good to me. > the image does not show the channel between orded and its children. > this is a currently a TCP socket (v1.10) and we are moving to Unix socket > (already in master) > > Which is the framework involved in this

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-27 Thread Gianmario Pozzi
Thank you guys, your help is really appriciated! We'll keep in touch for further information. Gianmario Il 23/ott/2015 12:44 "Jeff Squyres (jsquyres)" ha scritto: > On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > > Gianmario, > > > > there was c/

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-26 Thread Gilles Gouaillardet
Federico, that looks good to me. the image does not show the channel between orded and its children. this is a currently a TCP socket (v1.10) and we are moving to Unix socket (already in master) Cheers, Gilles On 10/26/2015 3:28 PM, Federico Reghenzani wrote: Hi Gilles, t​​hank you again fo

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-26 Thread Federico Reghenzani
Hi Gilles, t​​hank you again for your great answer. Our idea is to migrate tasks between nodes, possibly individually, and other tasks still run (obviously, if they want to communicate with "migrating" node, we should pause them). Just to be sure if we have understood correctly, is the attached i

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread George Bosilca
Each module has the opportunity to provide an ft_event function, that is supposedly called when a change in the module behavior is necessary. Thus, it is relatively easy to let the BTL knows about the fact that a particular destination process will migrate to a new location. George. On Fri, Oc

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Jeff Squyres (jsquyres)
On Oct 22, 2015, at 7:17 AM, Gilles Gouaillardet wrote: > > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. To be specific: the C/R support was removed from the v2.x branch because it is stale / not working. The support is still in master, albeit with Adrian'

Re: [OMPI devel] OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Gilles Gouaillardet
Gianmario, Iirc, there is one pipe between orted and each children stderr. stdout is a pty, and stdin is /dev/null, but it might be a pipe on task 0 This is the way stdout/stderr from tasks end up being printed by mpirun : orted does i/o forwarding (aka IOF) are you trying to migrate only one ta

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-23 Thread Federico Reghenzani
Hi Adrian and Gilles, first of all thank you for your responses. I'm working with Gianmario on this ambitious project. 2015-10-22 13:17 GMT+02:00 Gilles Gouaillardet < gilles.gouaillar...@gmail.com>: > Gianmario, > > there was c/r support in the v1.6 series but it has been removed. > the current

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Gilles Gouaillardet
Gianmario, there was c/r support in the v1.6 series but it has been removed. the current trend is to do application level checkpointing (much more efficient and much smaller checkpoint file size) iirc, ompi took care of closing/restoring all communication, and a third party checkpoint was require

Re: [OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Adrian Reber
On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote: > My team and I are working on the possibility to checkpoint a process and > restarting it on another node. We are using CRIU framework for the > checkpoint/restart part, but we are facing some issues related to migration. > > First

[OMPI devel] Checkpoint/restart + migration

2015-10-22 Thread Gianmario Pozzi
Hi everyone! My team and I are working on the possibility to checkpoint a process and restarting it on another node. We are using CRIU framework for the checkpoint/restart part, but we are facing some issues related to migration. First of all: we found out that some attempts to C/R an OMPI proces