Re: [OMPI devel] How can I achieve node fail over

Josh Hursey Mon, 11 Jan 2010 15:43:40 -0500


On Jan 6, 2010, at 9:04 AM, Sai Sudheesh wrote:

Hi,

      Just about two months ago I started experimenting with OpenMPI.
      I found this piece of software very interesting.

      How can I make this software fault tolerant?


Depends on what you mean my fault tolerant. :)

      As of now I am running this software on two machines
      having quad core processors and fedora 10.
      I am using openmpi1.3.2.

      If a remote machine fails while a parallel task running on both
the machines
      is it possible to reassign that task assigned to it  to some
other node available and

continue the computation instead of aborting the entirecomputation?

This scenario is currently not supported by Open MPI. If an MPIprocess fails, Open MPI will cleanup the job.

A few of us have been working on this scenario off-trunk for a whilenow. It is progressing nicely, but not available for publicconsumption just yet.

      Can anybody tell me where I have to look for more information
regarding this.
      I have tried with FT MPI but tired of it.


FT-MPI should be able to work in this scenario.

      I have also heard of CIFTS-FTB, can I use for solving this?

The CIFTS FTB is focused on a slightly different problem, that ofcoordination amongst software components before/during/after afailure. Currently, Open MPI is able to interact with the CIFTS FTB tosend fault information. Soon, Open MPI will be able to respond to suchfault information and take appropriate actions. The first generationof this work is scheduled to be brought into the Open MPI trunk soon,and will support catching of some basic events. Handling the scenarioyou mentioned at the top of the message will come shortly thereafter.

      Is it necessary to make a source code change?

In some cases yes, in others no. It really depends on what the finalsolution set looks like and how involved your application wants to bein the recovery process. At the very least, the application willlikely have to specify the MPI_ERRORS_RETURN error handler for eachcommunicator to override the default MPI_ERRORS_ARE_FATAL.

      Have anybody a solution already with you?

There are a couple of transparent fault tolerance solutions in thecurrent trunk.- Checkpoint/Restart of the entire MPI job (requires full jobrestart on failure)

   http://www.osl.iu.edu/research/ft/ompi-cr/
 - Message Logging:
   https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR

For non-MPI jobs you could also check out the Open Resilient ClusterManager (ORCM) project:

  http://www.open-mpi.org/projects/orcm/


      If an application is killed by OS at the remote node
      mpirun is aborting and reports an error.
      What kind of signal the remote orted is to mpirun?
      How can I handle it?

I'm not sure what your asking here. The orted detects the localprocess failure and notifies the mpirun process using the OOB (out-of-band) communication channel. The mpirun process then initiates theshutdown procedure.


-- Josh


      I know that I have asked a lot of questions..
      I will be thankful to you If anybody could respond with
      at least some suggestions.

with love
sudheesh.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] How can I achieve node fail over

Reply via email to