Ken,
At UTK we focus on developing two generic frameworks for scalable fault
tolerant approaches. One is based on uncoordinated checkpoint/restart while the
other is application level.
1) uncoordinated C/R based on message logging. Such approaches are fully
automatic, rely on an external check
Thanks. I've read your (Joshua Hersey's) Ph.D. thesis on fault
tolerance using checkpointing with much interest. It would be of further
interest to get the range of possible user requirements for defining the
behaviors in response to various faults.
Ken Lloyd
On Fri, 2011-04-22 at 15:03 -0400, J
On Apr 22, 2011, at 1:20 PM, N.M. Maclaren wrote:
> On Apr 22 2011, Ralph Castain wrote:
>
>> Several of us are. Josh and George (plus teammates), and some other outside
>> folks, are working the MPI side of it.
>>
>> I'm working only the ORTE side of the problem.
>>
>> Quite a bit of capabil
The MPI forum is in the process of fefining this - the work going on at ORNL is
in this context.
Rich
- Original Message -
From: N.M. Maclaren [mailto:n...@cam.ac.uk]
Sent: Friday, April 22, 2011 01:20 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] Adaptive or fault-tolerant MPI
On Apr 22 2011, Ralph Castain wrote:
Several of us are. Josh and George (plus teammates), and some other
outside folks, are working the MPI side of it.
I'm working only the ORTE side of the problem.
Quite a bit of capability is already in the trunk, but there is always
more to do :-)
Is th
Several of us are. Josh and George (plus teammates), and some other outside
folks, are working the MPI side of it.
I'm working only the ORTE side of the problem.
Quite a bit of capability is already in the trunk, but there is always more to
do :-)
On Apr 22, 2011, at 9:09 AM, Ken Lloyd wrote:
Before I jump in, is anyone already actively working in this area?
Ken