Re: Hama Fault Tolerance

Suraj Menon Thu, 05 Apr 2012 04:39:54 -0700

Hey Praveen,

https://issues.apache.org/jira/browse/HAMA-505 is an umbrella issue to all
the fault tolerance design and implementation issues.
Please read the discussion thread "Recovering issues" here -
http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201203.mbox/browser
that
has a gist of where we are headed for this issue.


Fault tolerance in task execution is scheduled for 0.6. I would be updating
the Wiki with the design sometime.

-Suraj

On Thu, Apr 5, 2012 at 7:14 AM, Thomas Jungblut <
[email protected]> wrote:

> Currently if failure occurs, the whole job is killed.
> After 503, it will restart a single tasks when it fails at superstep 5.
> Yes the state (messages) are stored in the sync() method.
>
> 2) What other fault tolerance features are implemented in Hama?
> >
>
> None yet.
>
> 3) What is check pointing in Hama?
> >
>
> Writing sent messages to HDFS after a computation phase.
>
> Am 5. April 2012 09:10 schrieb Praveen Sripati <[email protected]>:
>
> > 1) If a BSPJob has 10 super steps and a task fails at step 5, does the
> job
> > need to be run again? Is Hama-503 the solution? Is the state of the job
> > stored in HDFS between super steps?
> >
> > 2) What other fault tolerance features are implemented in Hama?
> >
> > 3) What is check pointing in Hama?
> >
> > Praveen
> >
>
>
>
> --
> Thomas Jungblut
> Berlin <[email protected]>
>

Re: Hama Fault Tolerance

Reply via email to