I have created a new CRS component using criu (criu.org) to support checkpoint/restart in Open MPI. My current patch only provides the framework and necessary configure scripts to detect and link against criu. With this patch orte-checkpoint can request a checkpoint and the new CRIU CRS component is used:
[dcbz:13766] orte_cr: init: orte_cr_init() [dcbz:13766] crs:criu: opal_crs_criu_prelaunch [dcbz:13766] crs:criu: opal_crs_criu_prelaunch [dcbz:13771] opal_cr: init: Verbose Level: 30 [dcbz:13771] opal_cr: init: FT Enabled: true [dcbz:13771] opal_cr: init: Is a tool program: false [dcbz:13771] opal_cr: init: Debug SIGPIPE: 30 (False) [dcbz:13771] opal_cr: init: Checkpoint Signal: 10 [dcbz:13771] opal_cr: init: FT Use thread: true [dcbz:13771] opal_cr: init: FT thread sleep: check = 0, wait = 100 [dcbz:13771] opal_cr: init: C/R Debugging Enabled [False] [dcbz:13771] opal_cr: init: Checkpoint Signal (Debug): 20 [dcbz:13771] opal_cr: init: Temp Directory: /tmp ... [dcbz:13772] orte_cr: coord: orte_cr_coord(Checkpoint) [dcbz:13772] orte_cr: coord_pre_ckpt: orte_cr_coord_pre_ckpt() [dcbz:13772] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt() [dcbz:13772] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt() [dcbz:13772] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint. [dcbz:13772] crs:criu: checkpoint(13772, ---) [dcbz:13772] crs:criu: criu_init_opts() returned 0 [dcbz:13771] orte_cr: coord_post_ckpt: orte_cr_coord_post_ckpt() [dcbz:13771] ompi_cr: coord_post_ckpt: ompi_cr_coord_post_ckpt() [dcbz:13771] opal_cr: opal_cr_inc_core_ckpt: Take the checkpoint. [dcbz:13771] crs:criu: checkpoint(13771, ---) [dcbz:13771] crs:criu: criu_init_opts() returned 0 ... [dcbz:13766] 13766: Checkpoint established for process [55729,0]. [dcbz:13771] ompi_cr: coord: ompi_cr_coord(Running) [dcbz:13771] orte_cr: coord: orte_cr_coord(Running) [dcbz:13766] 13766: Successfully restarted process [55729,0]. [dcbz:13772] ompi_cr: coord: ompi_cr_coord(Running) [dcbz:13772] orte_cr: coord: orte_cr_coord(Running) It seems the C/R code basically works again and now needs to be filled with the actual code to take checkpoints using criu. The patch I want to check in is available at: https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=7e0c7c940705cc572242097ff53f9e0ee6db11ea The patch only creates files in opal/mca/crs/criu and does not touch any other code. Adrian