Task-farm or manager/worker recovery models typically depend on intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI implementation. William Gropp and Ewing Lusk have a paper entitled "Fault Tolerance in MPI Programs" that outlines how an application might take advantage of these features in order to recover from process failure.

However, these techniques strongly depend upon resilient MPI implementations, and behaviors that, some may argue, are non-standard. Unfortunately there are not many MPI implementations that are sufficiently resilient in the face of process failure to support failure in task-farm scenarios. Though Open MPI supports the current MPI 2.1 standard, it is not as resilient to process failure as it could be.

There are a number of people working on improving the resiliency of Open MPI in the face of network and process failure (including myself). We have started to move some of the resiliency work into the Open MPI trunk. Resiliency in Open MPI has been improving over the past few months, but I would not assess it as ready quite yet. Most of the work has focused on the runtime level (ORTE), and there are still some MPI level (OMPI) issues that need to be worked out.

With all of that being said, I would try some of the techniques presented in the Gropp/Lusk paper in your application. Then test it with Open MPI and let us know how it goes.

Best,
Josh

On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:

Is that kind of approach possible within an MPI framework? Perhaps a
grid approach would be better. More experienced people, speak up,
please?
(The reason I say that is that I too am interested in the solution of
that kind of problem, where an individual blade of a blade server
fails and correcting for that failure on the fly is better than taking
checkpoints and restarting the whole process excluding the failed
blade.

Durga

On Mon, Aug 3, 2009 at 9:21 AM, jody<jody....@gmail.com> wrote:
Hi

I guess "task-farming" could give you a certain amount of the kind of
fault-tolerance you want.
(i.e. a master process distributes tasks to idle slave processors -
however, this will only work
if the slave processes don't need to communicate with each other)

Jody


On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkuma...@gmail.com> wrote:
Hi all,

Thanks Durga for your reply.

Jeff, once you wrote code for Mandelbrot set to demonstrate fault tolerance
in LAM-MPI. i. e. killing any slave process doesn't
affect others. Exact behaviour I am looking for in Open MPI. I attempted, but no luck. Can you please tell how to write such programs in Open MPI.

Thanks in advance.

Regards,
On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpcho...@gmail.com> wrote:

Although I have perhaps the least experience on the topic in this
list, I will take a shot; more experienced people, please correct me:

MPI standards specify communication mechanism, not fault tolerance at
any level. You may achieve network tolerance at the IP level by
implementing 'equal cost multipath' routes (which means two equally
capable NIC cards connecting to the same destination and modifying the
kernel routing table to use both cards; the kernel will dynamically
load balance.). At the MAC level, you can achieve the same effect by
trunking multiple network cards.

You can achieve process level fault tolerance by a checkpointing
scheme such as BLCR, which has been tested to work with OpenMPI (and
other processes as well)

Durga

On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkuma...@gmail.com> wrote:

Hi all,

I want to know whether open mpi supports Network and process fault
tolerance
or not? If there is any example demonstrating these features that will
be
best.

Regards,
--
Vipin K.
Research Engineer,
C-DOTB, India

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Vipin K.
Research Engineer,
C-DOTB, India

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to