Hi All,

Actually I had success recovering a faulty process from a previous checkpoint. There are three situations which I can handle:

1) application process fault
   Caused by: error during memory allocation? processor errors?
   Restart: in this case I recover the process on the same orted

2) orted process fault
   Caused by: error during memory allocation?  processor errors?
Restart: in this case I recover the processes managed by the faulty orted on an available node (if it exists) or in other orted.

3) node fault (or isolated node)
   Caused by: there are a lot of situations: network, energy, crash...
Restart: in this case I recover the processes managed by the faulty orted on an available node (if it exists) or in other orted.

All the processes are launched and restarted correctly, all the environment variables used by the ESS are configured automatically by the restart routine. However, the communications are a problem for me because the BML/BTL keeps the connection to the faulty process cached. I think that Josh does not have this problem because he restart all the process.

I saw that a possible solution is to perform a close/open operation on BML, but I do not want to do it in all processes, neither on the process which has a communication with my faulty process cached. My idea is: when the following error occur,

[btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_complete_connect] connect() failed: Connection refused (111)

the connection to the faulty process is removed from the cache and a new request for the NS is performed. The process location and state is maintained up to date on the HNP by my FT routines. What do you think about this?

Thanks,

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Reply via email to