[OMPI devel] Restarting processes on different node

Leonardo Fialho Wed, 22 Oct 2008 13:28:42 -0400

Hi All,

I´m trying to implement my FT architecture in Open MPI. Just now I needto restart a faulty process from a checkpoint. I saw that Josh usesorte-restart which call opal-restart through an ordinary mpirun call.It´s now good for me because in this case the restarted process becomesin a new job. I need to restart the process checkpoint in the same joband in another node under an existing orted. The checkpoints are takenwithout the "--term" option.

My modified orted receive a "restart request" from my modified heartbeatmechanism. I have tried to restart using the BLCR cr_restart command. Itdoes not work, I think because the stderr/stdin/stdout was not handledby the opal environment. So, I tried to restart the checkpoint forkingthe orted and doing an execvp to the opal-restart. It recovers thecheckpoint, but after the "opal_cr_init", it dies (*** Process receivedsignal ***).


As follows is the job structure (from ompi-ps) after a fault:

-------------------------------------------------------------------------------------

orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running ||orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running |[[8002,0],3] |orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty |[[8002,0],3] |orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running |[[8002,0],4] |orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running |[[8002,0],1] |

------------------------------------------------------------------------------------------------------------------

./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running| Finished | /tmp/radic/0 | [[8002,0],2] |./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring| Finished | /tmp/radic/1 | [[8002,0],3] |./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running| Finished | /tmp/radic/2 | [[8002,0],4] |./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running| Finished | /tmp/radic/3 | [[8002,0],1] |

The orted running on "nodo2" dies. It was detected by the orted[[8002,0],1] running on "nodo1" and informed to the HNP. The HNP updatethe procs structure and look for processes running on the faulty node,so it sends a restart request for the orted which holds the checkpointof the faulty processes.


Below is the log generated:

[aoclsb:30434] [[8002,0],0] orted_recv: update state request from[[8002,0],3][aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17) fororted process (vpid=2)[aoclsb:30434] [[8002,0],0] orted_update_state: found process[[8002,1],1] on node nodo2, requesting recovery task for that[aoclsb:30434] [[8002,0],0] orted_update_state: sending restore([[8002,1],1] process) request to [[8002,0],3][nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from[[8002,0],0][nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting processfrom checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart(opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)

[nodo3:05924] opal_cr: init: Verbose Level: 1024
[nodo3:05924] opal_cr: init: FT Enabled: 1
[nodo3:05924] opal_cr: init: Is a tool program: 1
[nodo3:05924] opal_cr: init: Checkpoint Signal: 10
[nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
[nodo3:05924] opal_cr: init: Temp Directory: /tmp
[nodo2:05965] *** Process received signal ***

The orted which receives the restart request forks and the call anexecvp for the opal-restart, and then, unfortunately, it dies. I knowthat the restarted process should generate errors because the URI of itdaemon is incorrect like all other enviroment variables, but it wouldgenerate a communication error, or any kind of error other than aprocess kill. My question is:

1) Why this process dies? I suspect that the checkpoint have pointerswhich points to libraries which are not loaded, or are loaded ondifferent memory position (because this checkpoint becomes from anothernode). In this case the error should be "segmentation fault" orsomething like this, no?

If somebody have some information or can give me some help about thiserror I´ll be grateful.


Thanks--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

[OMPI devel] Restarting processes on different node

Reply via email to