Re: about fault tolerance in Giraph

Yuanyuan Tian Mon, 18 Mar 2013 14:39:17 -0700

Hi Avery,

I was just testing how Giraph can handle fault tolerance. I wrote a simple 
algorithm that could run without a problem. Then I artificially added a 
line of code to throw an IOException for the 12th superstep when the 
taskID is the 0001 and attempt ID is 0000. The job returned the excepted 
IOException, but it cannot recover from it. There is no retry of the 
failed task, even though there are empty map slots left in the cluster. 
Eventually, the whole job failed after time out.

Yuanyuan

From:   Avery Ching <ach...@apache.org>
To:     user@giraph.apache.org
Date:   03/18/2013 02:09 PM
Subject:        Re: about fault tolerance in Giraph

Hi Yuanyuan,

We haven't tested this feature in a while.  But it should work.  What did 
the job report about why it failed?

Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote:
Can anyone help me answer the question? 

Yuanyuan 

From:        Yuanyuan Tian/Almaden/IBM@IBMUS 
To:        user@giraph.apache.org 
Date:        03/15/2013 02:05 PM 
Subject:        about fault tolerance in Giraph 

Hi 

I was testing the fault tolerance of Giraph on a long running job. I 
noticed that when one of the worker throw an exception, the whole job 
failed without retrying the task, even though I turned on the 
checkpointing and there were available map slots in my cluster. Why wasn't 
the fault tolerance mechanism working? 

I was running a version of Giraph downloaded sometime in June 2012 and I 
used Netty for the communication layer. 

Thanks, 

Yuanyuan

Re: about fault tolerance in Giraph

Reply via email to