How many retries did you set for hadoop map task failures? Might want
to try 10?
Avery
On 3/18/13 2:38 PM, Yuanyuan Tian wrote:
Hi Avery,
I was just testing how Giraph can handle fault tolerance. I wrote a
simple algorithm that could run without a problem. Then I artificially
added a line of code to throw an IOException for the 12th superstep
when the taskID is the 0001 and attempt ID is 0000. The job returned
the excepted IOException, but it cannot recover from it. There is no
retry of the failed task, even though there are empty map slots left
in the cluster. Eventually, the whole job failed after time out.
Yuanyuan
From: Avery Ching <ach...@apache.org>
To: user@giraph.apache.org
Date: 03/18/2013 02:09 PM
Subject: Re: about fault tolerance in Giraph
------------------------------------------------------------------------
Hi Yuanyuan,
We haven't tested this feature in a while. But it should work. What
did the job report about why it failed?
Avery
On 3/18/13 10:22 AM, Yuanyuan Tian wrote:
Can anyone help me answer the question?
Yuanyuan
From: Yuanyuan Tian/Almaden/IBM@IBMUS
To: _user@giraph.apache.org_ <mailto:user@giraph.apache.org>
Date: 03/15/2013 02:05 PM
Subject: about fault tolerance in Giraph
------------------------------------------------------------------------
Hi
I was testing the fault tolerance of Giraph on a long running job. I
noticed that when one of the worker throw an exception, the whole job
failed without retrying the task, even though I turned on the
checkpointing and there were available map slots in my cluster. Why
wasn't the fault tolerance mechanism working?
I was running a version of Giraph downloaded sometime in June 2012 and
I used Netty for the communication layer.
Thanks,
Yuanyuan