Re: Hama Fault Tolerance

Praveen Sripati Thu, 05 Apr 2012 07:14:34 -0700

Thanks for the clarification. So, the messages are stored in HDFS whenever
there is a checkpoint and in case of any failure the tasks will execute
from the last checkpoint state.


Praveen

On Thu, Apr 5, 2012 at 5:09 PM, Suraj Menon <[email protected]> wrote:

> Hey Praveen,
>
> https://issues.apache.org/jira/browse/HAMA-505 is an umbrella issue to all
> the fault tolerance design and implementation issues.
> Please read the discussion thread "Recovering issues" here -
>
> http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201203.mbox/browser
> that
> has a gist of where we are headed for this issue.
>
> Fault tolerance in task execution is scheduled for 0.6. I would be updating
> the Wiki with the design sometime.
>
> -Suraj
>
> On Thu, Apr 5, 2012 at 7:14 AM, Thomas Jungblut <
> [email protected]> wrote:
>
> > Currently if failure occurs, the whole job is killed.
> > After 503, it will restart a single tasks when it fails at superstep 5.
> > Yes the state (messages) are stored in the sync() method.
> >
> > 2) What other fault tolerance features are implemented in Hama?
> > >
> >
> > None yet.
> >
> > 3) What is check pointing in Hama?
> > >
> >
> > Writing sent messages to HDFS after a computation phase.
> >
> > Am 5. April 2012 09:10 schrieb Praveen Sripati <[email protected]
> >:
> >
> > > 1) If a BSPJob has 10 super steps and a task fails at step 5, does the
> > job
> > > need to be run again? Is Hama-503 the solution? Is the state of the job
> > > stored in HDFS between super steps?
> > >
> > > 2) What other fault tolerance features are implemented in Hama?
> > >
> > > 3) What is check pointing in Hama?
> > >
> > > Praveen
> > >
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <[email protected]>
> >
>

Re: Hama Fault Tolerance

Reply via email to