Hi, Just about two months ago I started experimenting with OpenMPI. I found this piece of software very interesting.
How can I make this software fault tolerant? As of now I am running this software on two machines having quad core processors and fedora 10. I am using openmpi1.3.2. If a remote machine fails while a parallel task running on both the machines is it possible to reassign that task assigned to it to some other node available and continue the computation instead of aborting the entire computation? Can anybody tell me where I have to look for more information regarding this. I have tried with FT MPI but tired of it. I have also heard of CIFTS-FTB, can I use for solving this? Is it necessary to make a source code change? Have anybody a solution already with you? If an application is killed by OS at the remote node mpirun is aborting and reports an error. What kind of signal the remote orted is to mpirun? How can I handle it? I know that I have asked a lot of questions.. I will be thankful to you If anybody could respond with at least some suggestions. with love sudheesh.