On Thu, Oct 22, 2015 at 12:15:22PM +0200, Gianmario Pozzi wrote: > My team and I are working on the possibility to checkpoint a process and > restarting it on another node. We are using CRIU framework for the > checkpoint/restart part, but we are facing some issues related to migration. > > First of all: we found out that some attempts to C/R an OMPI process have > been already made in the past. Is anything related to that still > supported/available/working?
I was working on the CRIU <-> OpenMPI integration during 2013/2014. The code is still available at: https://github.com/open-mpi/ompi/tree/master/opal/mca/crs/criu I was able to checkpoint and restart a process under OpenMPI's control: http://lisas.de/~adrian/?p=926 >From what I have heard/read OpenMPI has probably had enough internal changes that the Fault Tolerance framework is currently no longer working which is needed to use the checkpoint/restart functionality. In addition, CRIU has also changed a bit. I used the criu service daemon to start the checkpoint. This service daemon no longer exists due to security concerns: https://lwn.net/Articles/658070/ So you either need to call the criu binary directly or you can use 'criu swrk'. Restore should be easier as criu now supports the option --inherit-fd which should help to correctly re-route stdin/stdout/stderr. Adrian