Hi, I have 3 questions to ask about,
1, how does open-mpi find the faulty node? 2, if one node is dead, could the programs continue running? How about two nodes or even more nodes are dead ? 3, How to recovery faulty node (dead node) ? Is there any possibilities to recover without check-pointing, since it is time-consuming and decrease performance ? Thanks! Rui Wang ICT, P.R. China