Hey Praveen, https://issues.apache.org/jira/browse/HAMA-505 is an umbrella issue to all the fault tolerance design and implementation issues. Please read the discussion thread "Recovering issues" here - http://mail-archives.apache.org/mod_mbox/incubator-hama-dev/201203.mbox/browser that has a gist of where we are headed for this issue.
Fault tolerance in task execution is scheduled for 0.6. I would be updating the Wiki with the design sometime. -Suraj On Thu, Apr 5, 2012 at 7:14 AM, Thomas Jungblut < [email protected]> wrote: > Currently if failure occurs, the whole job is killed. > After 503, it will restart a single tasks when it fails at superstep 5. > Yes the state (messages) are stored in the sync() method. > > 2) What other fault tolerance features are implemented in Hama? > > > > None yet. > > 3) What is check pointing in Hama? > > > > Writing sent messages to HDFS after a computation phase. > > Am 5. April 2012 09:10 schrieb Praveen Sripati <[email protected]>: > > > 1) If a BSPJob has 10 super steps and a task fails at step 5, does the > job > > need to be run again? Is Hama-503 the solution? Is the state of the job > > stored in HDFS between super steps? > > > > 2) What other fault tolerance features are implemented in Hama? > > > > 3) What is check pointing in Hama? > > > > Praveen > > > > > > -- > Thomas Jungblut > Berlin <[email protected]> >
