One point I've been trying to put forward in my domain is, currently,
high performance computing != high reliability computing. Not by a long
shot. Seems that they are orthogonally coupled.

There are many pieces to this problem-puzzle. Some of these pieces are
inter-related. Some of my work has dealt with adaptive approaches -
especially re: cascade, and what Ralph refers to as "rewiring", or
routing issues.

If and when I have anything I believe meaningful to contribute, I will.

On Mon, 2011-06-27 at 08:32 -0400, Josh Hursey wrote:

> It has been on my to-do list for a while to start a FAQ listing of the
> various resilience/FT related activities in and around Open MPI. This
> would provide a starting location for users and new developers could
> go to for an overview of each of the features, and how to activate/use
> the feature.
> 
> 
> 
> I'll try to bump that up the priority list and post a message once it
> is ready. Probably a month or so off since I need to collect some
> information from various developers.
> 
> 
> -- Josh
> 
> 
> On Sun, Jun 26, 2011 at 6:01 PM, Ralph Castain <r...@open-mpi.org>
> wrote:
> 
>         I think we're some ways away from declaring a "resilient
>         ORTE". Josh and I have been committing pieces of it over the
>         last two years, and Wes just committed another piece the other
>         day that might have been titled "fault tolerant OOB" as it
>         primarily addressed maintaining comm routing during node
>         failures.
>         
>         
>         
>         Setting aside the obvious MPI issues, there are several
>         branches/organizations working different aspects of the ORTE
>         problem, including:
>         
>         
>         * fault prediction and proactive migration
>         
>         
>         * mapping algorithms to minimize failure cascades
>         
>         
>         * simultaneous failure handling
>         
>         
>         * alternative wiring methods that eliminate the OOB routing
>         issues
>         
>         
>         etc. We expect most of those developments to arrive over the
>         next 6-12 months. Once that has occurred, we'll probably be
>         close to what we would call a "resilient" system.
>         
>         
>         Until then, we are improving, but still far from "resilient".
>         
>         
>         
>         
>         
>         On Jun 24, 2011, at 10:24 AM, Ken Lloyd wrote:
>         
>         
>         
>         > 
>         > Josh and Wesley,
>         > 
>         > Will you be presenting Resilient ORTE at Resilience 2011 in
>         > Bordeaux?
>         > 
>         > http://xcr.cenit.latech.edu/resilience2011/
>         > 
>         > =====================
>         > Kenneth A. Lloyd
>         > CEO - Director of Systems Science
>         > Watt Systems Technologies Inc.
>         > www.wattsys.com
>         > kenneth.ll...@wattsys.com 
>         > 
>         > This e-mail is covered by the Electronic Communications
>         > Privacy Act, 18 U.S.C. 2510-2521 and is intended only for
>         > the addressee named above. It may contain privileged or
>         > confidential information. If you are not the addressee you
>         > must not copy, distribute, disclose or use any of the
>         > information in it. If you have received it in error please
>         > delete it and immediately notify the sender.
>         > 
>         > 
>         > 
>         > _______________________________________________
>         > devel mailing list
>         > de...@open-mpi.org
>         > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>         
>         
>         
>         
>         
>         _______________________________________________
>         devel mailing list
>         de...@open-mpi.org
>         http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Reply via email to