On Jan 6, 2010, at 9:04 AM, Sai Sudheesh wrote:

Hi,

      Just about two months ago I started experimenting with OpenMPI.
      I found this piece of software very interesting.

      How can I make this software fault tolerant?

Depends on what you mean my fault tolerant. :)

      As of now I am running this software on two machines
      having quad core processors and fedora 10.
      I am using openmpi1.3.2.

      If a remote machine fails while a parallel task running on both
the machines
      is it possible to reassign that task assigned to it  to some
other node available and
continue the computation instead of aborting the entire computation?

This scenario is currently not supported by Open MPI. If an MPI process fails, Open MPI will cleanup the job.

A few of us have been working on this scenario off-trunk for a while now. It is progressing nicely, but not available for public consumption just yet.


      Can anybody tell me where I have to look for more information
regarding this.
      I have tried with FT MPI but tired of it.

FT-MPI should be able to work in this scenario.

      I have also heard of CIFTS-FTB, can I use for solving this?

The CIFTS FTB is focused on a slightly different problem, that of coordination amongst software components before/during/after a failure. Currently, Open MPI is able to interact with the CIFTS FTB to send fault information. Soon, Open MPI will be able to respond to such fault information and take appropriate actions. The first generation of this work is scheduled to be brought into the Open MPI trunk soon, and will support catching of some basic events. Handling the scenario you mentioned at the top of the message will come shortly thereafter.

      Is it necessary to make a source code change?

In some cases yes, in others no. It really depends on what the final solution set looks like and how involved your application wants to be in the recovery process. At the very least, the application will likely have to specify the MPI_ERRORS_RETURN error handler for each communicator to override the default MPI_ERRORS_ARE_FATAL.


      Have anybody a solution already with you?

There are a couple of transparent fault tolerance solutions in the current trunk. - Checkpoint/Restart of the entire MPI job (requires full job restart on failure)
   http://www.osl.iu.edu/research/ft/ompi-cr/
 - Message Logging:
   https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR

For non-MPI jobs you could also check out the Open Resilient Cluster Manager (ORCM) project:
  http://www.open-mpi.org/projects/orcm/


      If an application is killed by OS at the remote node
      mpirun is aborting and reports an error.
      What kind of signal the remote orted is to mpirun?
      How can I handle it?

I'm not sure what your asking here. The orted detects the local process failure and notifies the mpirun process using the OOB (out-of- band) communication channel. The mpirun process then initiates the shutdown procedure.

-- Josh


      I know that I have asked a lot of questions..
      I will be thankful to you If anybody could respond with
      at least some suggestions.

with love
sudheesh.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to