On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote:
> My team and I are working on the possibility to checkpoint a process and
> restarting it on another node. We are using CRIU framework for the
> checkpoint/restart part, but we are facing some issues related to migration.
> 
> First of all: we found out that some attempts to C/R an OMPI process have
> been already made in the past. Is anything related to that still
> supported/available/working?

I was working on the CRIU <-> OpenMPI integration during 2013/2014. The
code is still available at:

https://github.com/open-mpi/ompi/tree/master/opal/mca/crs/criu

I was able to checkpoint and restart a process under OpenMPI's control:

http://lisas.de/~adrian/?p=926

>From what I have heard/read OpenMPI has probably had enough internal
changes that the Fault Tolerance framework is currently no longer
working which is needed to use the checkpoint/restart functionality.

In addition, CRIU has also changed a bit. I used the criu service daemon
to start the checkpoint. This service daemon no longer exists due to
security concerns:

https://lwn.net/Articles/658070/

So you either need to call the criu binary directly or you can use 'criu
swrk'.

Restore should be easier as criu now supports the option --inherit-fd
which should help to correctly re-route stdin/stdout/stderr.

                Adrian

Reply via email to