Hi All,
I´m trying to implement my FT architecture in Open MPI. Just now I need
to restart a faulty process from a checkpoint. I saw that Josh uses
orte-restart which call opal-restart through an ordinary mpirun call.
It´s now good for me because in this case the restarted process becomes
in a new job. I need to restart the process checkpoint in the same job
and in another node under an existing orted. The checkpoints are taken
without the "--term" option.
My modified orted receive a "restart request" from my modified heartbeat
mechanism. I have tried to restart using the BLCR cr_restart command. It
does not work, I think because the stderr/stdin/stdout was not handled
by the opal environment. So, I tried to restart the checkpoint forking
the orted and doing an execvp to the opal-restart. It recovers the
checkpoint, but after the "opal_cr_init", it dies (*** Process received
signal ***).
As follows is the job structure (from ompi-ps) after a fault:
Process Name | ORTE Name | Local Rank | PID | Node | State
| HB Dest. |
-------------------------------------------------------------------------------------
orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running |
|
orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running |
[[8002,0],3] |
orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty |
[[8002,0],3] |
orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running |
[[8002,0],4] |
orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running |
[[8002,0],1] |
Process Name | ORTE Name | Local Rank | PID | Node | State |
Ckpt State | Ckpt Loc | Protector |
------------------------------------------------------------------------------------------------------------------
./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running
| Finished | /tmp/radic/0 | [[8002,0],2] |
./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring
| Finished | /tmp/radic/1 | [[8002,0],3] |
./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running
| Finished | /tmp/radic/2 | [[8002,0],4] |
./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running
| Finished | /tmp/radic/3 | [[8002,0],1] |
The orted running on "nodo2" dies. It was detected by the orted
[[8002,0],1] running on "nodo1" and informed to the HNP. The HNP update
the procs structure and look for processes running on the faulty node,
so it sends a restart request for the orted which holds the checkpoint
of the faulty processes.
Below is the log generated:
[aoclsb:30434] [[8002,0],0] orted_recv: update state request from
[[8002,0],3]
[aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17) for
orted process (vpid=2)
[aoclsb:30434] [[8002,0],0] orted_update_state: found process
[[8002,1],1] on node nodo2, requesting recovery task for that
[aoclsb:30434] [[8002,0],0] orted_update_state: sending restore
([[8002,1],1] process) request to [[8002,0],3]
[nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from
[[8002,0],0]
[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting process
from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)
[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart
(opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)
[nodo3:05924] opal_cr: init: Verbose Level: 1024
[nodo3:05924] opal_cr: init: FT Enabled: 1
[nodo3:05924] opal_cr: init: Is a tool program: 1
[nodo3:05924] opal_cr: init: Checkpoint Signal: 10
[nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
[nodo3:05924] opal_cr: init: Temp Directory: /tmp
[nodo2:05965] *** Process received signal ***
The orted which receives the restart request forks and the call an
execvp for the opal-restart, and then, unfortunately, it dies. I know
that the restarted process should generate errors because the URI of it
daemon is incorrect like all other enviroment variables, but it would
generate a communication error, or any kind of error other than a
process kill. My question is:
1) Why this process dies? I suspect that the checkpoint have pointers
which points to libraries which are not loaded, or are loaded on
different memory position (because this checkpoint becomes from another
node). In this case the error should be "segmentation fault" or
something like this, no?
If somebody have some information or can give me some help about this
error I´ll be grateful.
Thanks--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478