Hi Josh, It is good to hear from you that work is in progress towards resiliency of Open-MPI. I was and I am waiting for this capability in Open-MPI. I have almost finished my development work and waiting for this to happen so that I can test my programs. It will be good if you can tell how long it will take to make Open-MPI a resilient impementation. Here by resiliency I mean abnormal termination or intentionally killing a process should not cause any(parent or sibling) process to be terminated, given that processes are connected.
thanks. Regards, On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: > Task-farm or manager/worker recovery models typically depend on > intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI > implementation. William Gropp and Ewing Lusk have a paper entitled "Fault > Tolerance in MPI Programs" that outlines how an application might take > advantage of these features in order to recover from process failure. > > However, these techniques strongly depend upon resilient MPI > implementations, and behaviors that, some may argue, are non-standard. > Unfortunately there are not many MPI implementations that are sufficiently > resilient in the face of process failure to support failure in task-farm > scenarios. Though Open MPI supports the current MPI 2.1 standard, it is not > as resilient to process failure as it could be. > > There are a number of people working on improving the resiliency of Open > MPI in the face of network and process failure (including myself). We have > started to move some of the resiliency work into the Open MPI trunk. > Resiliency in Open MPI has been improving over the past few months, but I > would not assess it as ready quite yet. Most of the work has focused on the > runtime level (ORTE), and there are still some MPI level (OMPI) issues that > need to be worked out. > > With all of that being said, I would try some of the techniques presented > in the Gropp/Lusk paper in your application. Then test it with Open MPI and > let us know how it goes. > > Best, > Josh > > > On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote: > > Is that kind of approach possible within an MPI framework? Perhaps a >> grid approach would be better. More experienced people, speak up, >> please? >> (The reason I say that is that I too am interested in the solution of >> that kind of problem, where an individual blade of a blade server >> fails and correcting for that failure on the fly is better than taking >> checkpoints and restarting the whole process excluding the failed >> blade. >> >> Durga >> >> On Mon, Aug 3, 2009 at 9:21 AM, jody<jody....@gmail.com> wrote: >> >>> Hi >>> >>> I guess "task-farming" could give you a certain amount of the kind of >>> fault-tolerance you want. >>> (i.e. a master process distributes tasks to idle slave processors - >>> however, this will only work >>> if the slave processes don't need to communicate with each other) >>> >>> Jody >>> >>> >>> On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkuma...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Thanks Durga for your reply. >>>> >>>> Jeff, once you wrote code for Mandelbrot set to demonstrate fault >>>> tolerance >>>> in LAM-MPI. i. e. killing any slave process doesn't >>>> affect others. Exact behaviour I am looking for in Open MPI. I >>>> attempted, >>>> but no luck. Can you please tell how to write such programs in Open MPI. >>>> >>>> Thanks in advance. >>>> >>>> Regards, >>>> On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpcho...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> Although I have perhaps the least experience on the topic in this >>>>> list, I will take a shot; more experienced people, please correct me: >>>>> >>>>> MPI standards specify communication mechanism, not fault tolerance at >>>>> any level. You may achieve network tolerance at the IP level by >>>>> implementing 'equal cost multipath' routes (which means two equally >>>>> capable NIC cards connecting to the same destination and modifying the >>>>> kernel routing table to use both cards; the kernel will dynamically >>>>> load balance.). At the MAC level, you can achieve the same effect by >>>>> trunking multiple network cards. >>>>> >>>>> You can achieve process level fault tolerance by a checkpointing >>>>> scheme such as BLCR, which has been tested to work with OpenMPI (and >>>>> other processes as well) >>>>> >>>>> Durga >>>>> >>>>> On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkuma...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I want to know whether open mpi supports Network and process fault >>>>>> tolerance >>>>>> or not? If there is any example demonstrating these features that will >>>>>> be >>>>>> best. >>>>>> >>>>>> Regards, >>>>>> -- >>>>>> Vipin K. >>>>>> Research Engineer, >>>>>> C-DOTB, India >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> >>>> >>>> -- >>>> Vipin K. >>>> Research Engineer, >>>> C-DOTB, India >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Vipin K. Research Engineer, C-DOTB, India