One point I've been trying to put forward in my domain is, currently, high performance computing != high reliability computing. Not by a long shot. Seems that they are orthogonally coupled.
There are many pieces to this problem-puzzle. Some of these pieces are inter-related. Some of my work has dealt with adaptive approaches - especially re: cascade, and what Ralph refers to as "rewiring", or routing issues. If and when I have anything I believe meaningful to contribute, I will. On Mon, 2011-06-27 at 08:32 -0400, Josh Hursey wrote: > It has been on my to-do list for a while to start a FAQ listing of the > various resilience/FT related activities in and around Open MPI. This > would provide a starting location for users and new developers could > go to for an overview of each of the features, and how to activate/use > the feature. > > > > I'll try to bump that up the priority list and post a message once it > is ready. Probably a month or so off since I need to collect some > information from various developers. > > > -- Josh > > > On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > I think we're some ways away from declaring a "resilient > ORTE". Josh and I have been committing pieces of it over the > last two years, and Wes just committed another piece the other > day that might have been titled "fault tolerant OOB" as it > primarily addressed maintaining comm routing during node > failures. > > > > Setting aside the obvious MPI issues, there are several > branches/organizations working different aspects of the ORTE > problem, including: > > > * fault prediction and proactive migration > > > * mapping algorithms to minimize failure cascades > > > * simultaneous failure handling > > > * alternative wiring methods that eliminate the OOB routing > issues > > > etc. We expect most of those developments to arrive over the > next 6-12 months. Once that has occurred, we'll probably be > close to what we would call a "resilient" system. > > > Until then, we are improving, but still far from "resilient". > > > > > > On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote: > > > > > > > Josh and Wesley, > > > > Will you be presenting Resilient ORTE at Resilience 2011 in > > Bordeaux? > > > > http://xcr.cenit.latech.edu/resilience2011/ > > > > ===================== > > Kenneth A. Lloyd > > CEO - Director of Systems Science > > Watt Systems Technologies Inc. > > www.wattsys.com > > kenneth.ll...@wattsys.com > > > > This e-mail is covered by the Electronic Communications > > Privacy Act, 18 U.S.C. 2510-2521 and is intended only for > > the addressee named above. It may contain privileged or > > confidential information. If you are not the addressee you > > must not copy, distribute, disclose or use any of the > > information in it. If you have received it in error please > > delete it and immediately notify the sender. > > > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel