Hi All,
Actually I had success recovering a faulty process from a previous
checkpoint. There are three situations which I can handle:
1) application process fault
Caused by: error during memory allocation? processor errors?
Restart: in this case I recover the process on the same orted
2) orted process fault
Caused by: error during memory allocation? processor errors?
Restart: in this case I recover the processes managed by the faulty
orted on an available node (if it exists) or in other orted.
3) node fault (or isolated node)
Caused by: there are a lot of situations: network, energy, crash...
Restart: in this case I recover the processes managed by the faulty
orted on an available node (if it exists) or in other orted.
All the processes are launched and restarted correctly, all the
environment variables used by the ESS are configured automatically by
the restart routine. However, the communications are a problem for me
because the BML/BTL keeps the connection to the faulty process cached. I
think that Josh does not have this problem because he restart all the
process.
I saw that a possible solution is to perform a close/open operation on
BML, but I do not want to do it in all processes, neither on the process
which has a communication with my faulty process cached. My idea is:
when the following error occur,
[btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_complete_connect] connect()
failed: Connection refused (111)
the connection to the faulty process is removed from the cache and a new
request for the NS is performed. The process location and state is
maintained up to date on the HNP by my FT routines. What do you think
about this?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478