Task-farm or manager/worker recovery models typically depend on
intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI
implementation. William Gropp and Ewing Lusk have a paper entitled
"Fault Tolerance in MPI Programs" that outlines how an application
might take advantage of these features in order to recover from
process failure.
However, these techniques strongly depend upon resilient MPI
implementations, and behaviors that, some may argue, are non-standard.
Unfortunately there are not many MPI implementations that are
sufficiently resilient in the face of process failure to support
failure in task-farm scenarios. Though Open MPI supports the current
MPI 2.1 standard, it is not as resilient to process failure as it
could be.
There are a number of people working on improving the resiliency of
Open MPI in the face of network and process failure (including
myself). We have started to move some of the resiliency work into the
Open MPI trunk. Resiliency in Open MPI has been improving over the
past few months, but I would not assess it as ready quite yet. Most of
the work has focused on the runtime level (ORTE), and there are still
some MPI level (OMPI) issues that need to be worked out.
With all of that being said, I would try some of the techniques
presented in the Gropp/Lusk paper in your application. Then test it
with Open MPI and let us know how it goes.
Best,
Josh
On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:
Is that kind of approach possible within an MPI framework? Perhaps a
grid approach would be better. More experienced people, speak up,
please?
(The reason I say that is that I too am interested in the solution of
that kind of problem, where an individual blade of a blade server
fails and correcting for that failure on the fly is better than taking
checkpoints and restarting the whole process excluding the failed
blade.
Durga
On Mon, Aug 3, 2009 at 9:21 AM, jody<jody....@gmail.com> wrote:
Hi
I guess "task-farming" could give you a certain amount of the kind of
fault-tolerance you want.
(i.e. a master process distributes tasks to idle slave processors -
however, this will only work
if the slave processes don't need to communicate with each other)
Jody
On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkuma...@gmail.com>
wrote:
Hi all,
Thanks Durga for your reply.
Jeff, once you wrote code for Mandelbrot set to demonstrate fault
tolerance
in LAM-MPI. i. e. killing any slave process doesn't
affect others. Exact behaviour I am looking for in Open MPI. I
attempted,
but no luck. Can you please tell how to write such programs in
Open MPI.
Thanks in advance.
Regards,
On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury
<dpcho...@gmail.com> wrote:
Although I have perhaps the least experience on the topic in this
list, I will take a shot; more experienced people, please correct
me:
MPI standards specify communication mechanism, not fault
tolerance at
any level. You may achieve network tolerance at the IP level by
implementing 'equal cost multipath' routes (which means two equally
capable NIC cards connecting to the same destination and
modifying the
kernel routing table to use both cards; the kernel will dynamically
load balance.). At the MAC level, you can achieve the same effect
by
trunking multiple network cards.
You can achieve process level fault tolerance by a checkpointing
scheme such as BLCR, which has been tested to work with OpenMPI
(and
other processes as well)
Durga
On Thu, Jul 9, 2009 at 4:57 AM, vipin
kumar<vipinkuma...@gmail.com> wrote:
Hi all,
I want to know whether open mpi supports Network and process fault
tolerance
or not? If there is any example demonstrating these features
that will
be
best.
Regards,
--
Vipin K.
Research Engineer,
C-DOTB, India
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Vipin K.
Research Engineer,
C-DOTB, India
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users