Hi there Lately I've been reading lots of papers about fault tolerance for MPI applications. All seemed very nice and clear. But as soon as I pass the reading part to start testing I had my surprise as there I can not find implementations. The best I could find is the possibility of manually checkpoint and restart the application. No checkpoint protocol, no checkpoint manager, no recovery protocol. Can you please help and point me to a user transparent fault tolerance implementation for MPI applications?
Thanks a lot, Andreea