Re: about fault tolerance in Giraph

Avery Ching Mon, 18 Mar 2013 15:04:01 -0700

How many retries did you set for hadoop map task failures? Might wantto try 10?


Avery


On 3/18/13 2:38 PM, Yuanyuan Tian wrote:

Hi Avery,
I was just testing how Giraph can handle fault tolerance. I wrote asimple algorithm that could run without a problem. Then I artificiallyadded a line of code to throw an IOException for the 12th superstepwhen the taskID is the 0001 and attempt ID is 0000. The job returnedthe excepted IOException, but it cannot recover from it. There is noretry of the failed task, even though there are empty map slots leftin the cluster. Eventually, the whole job failed after time out.
Yuanyuan



From: Avery Ching <ach...@apache.org>
To: user@giraph.apache.org
Date: 03/18/2013 02:09 PM
Subject: Re: about fault tolerance in Giraph
------------------------------------------------------------------------



Hi Yuanyuan,
We haven't tested this feature in a while. But it should work. Whatdid the job report about why it failed?
Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote:
Can anyone help me answer the question?

Yuanyuan



From: Yuanyuan Tian/Almaden/IBM@IBMUS
To: _user@giraph.apache.org_ <mailto:user@giraph.apache.org>
Date: 03/15/2013 02:05 PM
Subject: about fault tolerance in Giraph
------------------------------------------------------------------------



Hi
I was testing the fault tolerance of Giraph on a long running job. Inoticed that when one of the worker throw an exception, the whole jobfailed without retrying the task, even though I turned on thecheckpointing and there were available map slots in my cluster. Whywasn't the fault tolerance mechanism working?
I was running a version of Giraph downloaded sometime in June 2012 andI used Netty for the communication layer.
Thanks,

Yuanyuan

Re: about fault tolerance in Giraph

Reply via email to