I'm having some difficulty geting the Open MPI checkpoint/restart fault
tolerance working.  I have compiled Open MPI with the "--with-ft=cr"
flag, but when I attempt to run my test program (ring), the
ompi-checkpoint command fails.  I have verified that the test program
works fine without the fault tolerance enabled.  Here are the details:
 
     [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
     [me@dev1 ~]$ ps -efa | grep mpirun
     me     3052  2820  1 08:25 pts/2    00:00:00 mpirun -np 4 -am
ft-enable-cr ring
 

     [me@dev1 ~]$ ompi-checkpoint 3052
     [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error:
5854512 in file sds_singleton_module.c at line 50
     [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown error:
5854512 in file runtime/orte_init.c at line 311
 
------------------------------------------------------------------------
--
     It looks like orte_init failed for some reason; your parallel
process is
     likely to abort.  There are many reasons that a parallel process
can
     fail during orte_init; some of which are due to configuration or
     environment problems.  This failure appears to be an internal
failure;
     here's some additional information (which may only be relevant to
an
     Open MPI developer):
 
       orte_sds_base_set_name failed
       --> Returned value Unknown error: 5854512 (5854512) instead of
ORTE_SUCCESS
 
 
------------------------------------------------------------------------
--

Any help would be appreciated.  Thanks.

Attachment: ompi_info.txt.gz
Description: ompi_info.txt.gz

Attachment: config.log.gz
Description: config.log.gz

Reply via email to