Thanks for the clarification. So, the messages are stored in HDFS whenever there is a checkpoint and in case of any failure the tasks will execute from the last checkpoint state.
Praveen On Thu, Apr 5, 2012 at 5:09 PM, Suraj Menon <[email protected]> wrote: > Hey Praveen, > > https://issues.apache.org/jira/browse/HAMA-505 is an umbrella issue to all > the fault tolerance design and implementation issues. > Please read the discussion thread "Recovering issues" here - > > http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201203.mbox/browser > that > has a gist of where we are headed for this issue. > > Fault tolerance in task execution is scheduled for 0.6. I would be updating > the Wiki with the design sometime. > > -Suraj > > On Thu, Apr 5, 2012 at 7:14 AM, Thomas Jungblut < > [email protected]> wrote: > > > Currently if failure occurs, the whole job is killed. > > After 503, it will restart a single tasks when it fails at superstep 5. > > Yes the state (messages) are stored in the sync() method. > > > > 2) What other fault tolerance features are implemented in Hama? > > > > > > > None yet. > > > > 3) What is check pointing in Hama? > > > > > > > Writing sent messages to HDFS after a computation phase. > > > > Am 5. April 2012 09:10 schrieb Praveen Sripati <[email protected] > >: > > > > > 1) If a BSPJob has 10 super steps and a task fails at step 5, does the > > job > > > need to be run again? Is Hama-503 the solution? Is the state of the job > > > stored in HDFS between super steps? > > > > > > 2) What other fault tolerance features are implemented in Hama? > > > > > > 3) What is check pointing in Hama? > > > > > > Praveen > > > > > > > > > > > -- > > Thomas Jungblut > > Berlin <[email protected]> > > >
