[OMPI devel] Communications and it cache

Leonardo Fialho Fri, 31 Oct 2008 12:51:11 -0400

Hi All,

Actually I had success recovering a faulty process from a previouscheckpoint. There are three situations which I can handle:


1) application process fault
   Caused by: error during memory allocation? processor errors?
   Restart: in this case I recover the process on the same orted

2) orted process fault
   Caused by: error during memory allocation?  processor errors?

Restart: in this case I recover the processes managed by the faultyorted on an available node (if it exists) or in other orted.


3) node fault (or isolated node)
   Caused by: there are a lot of situations: network, energy, crash...

Restart: in this case I recover the processes managed by the faultyorted on an available node (if it exists) or in other orted.

All the processes are launched and restarted correctly, all theenvironment variables used by the ESS are configured automatically bythe restart routine. However, the communications are a problem for mebecause the BML/BTL keeps the connection to the faulty process cached. Ithink that Josh does not have this problem because he restart all theprocess.

I saw that a possible solution is to perform a close/open operation onBML, but I do not want to do it in all processes, neither on the processwhich has a communication with my faulty process cached. My idea is:when the following error occur,

[btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_complete_connect] connect()failed: Connection refused (111)

the connection to the faulty process is removed from the cache and a newrequest for the NS is performed. The process location and state ismaintained up to date on the HNP by my FT routines. What do you thinkabout this?


Thanks,

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

[OMPI devel] Communications and it cache

Reply via email to