I'm having some difficulty geting the Open MPI checkpoint/restart fault tolerance working. I have compiled Open MPI with the "--with-ft=cr" flag, but when I attempt to run my test program (ring), the ompi-checkpoint command fails. I have verified that the test program works fine without the fault tolerance enabled. Here are the details: [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring [me@dev1 ~]$ ps -efa | grep mpirun me 3052 2820 1 08:25 pts/2 00:00:00 mpirun -np 4 -am ft-enable-cr ring
[me@dev1 ~]$ ompi-checkpoint 3052 [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error: 5854512 in file sds_singleton_module.c at line 50 [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error: 5854512 in file runtime/orte_init.c at line 311 ------------------------------------------------------------------------ -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_sds_base_set_name failed --> Returned value Unknown error: 5854512 (5854512) instead of ORTE_SUCCESS ------------------------------------------------------------------------ -- Any help would be appreciated. Thanks.
ompi_info.txt.gz
Description: ompi_info.txt.gz
config.log.gz
Description: config.log.gz