I have tried increase number of reties to 10. But it is the same: no retry 
at all. Isn't retrying of failed task the default behavior for hadoop? Why 
isn't it working in the case of Giraph? 

Here is the message from master: 

2013-03-18 18:23:30,628 ERROR org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Did not receive enough processes in time (only 54 of 55 
required).  This occurs if you do not have enough map tasks available 
simultaneously on your Hadoop instance to fulfill the number of requested 
workers.
2013-03-18 18:23:30,628 FATAL org.apache.giraph.graph.BspServiceMaster: 
coordinateSuperstep: Not enough healthy workers for superstep 12
2013-03-18 18:23:30,629 INFO org.apache.giraph.graph.BspServiceMaster: 
setJobState: 
{"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on 
superstep 12
2013-03-18 18:23:30,649 FATAL org.apache.giraph.graph.BspServiceMaster: 
failJob: Killing job job_201303181655_0004
2013-03-18 18:23:30,703 FATAL org.apache.giraph.graph.GraphMapper: 
uncaughtException: OverrideExceptionHandler on thread 
org.apache.giraph.graph.MasterThread, msg = null, exiting...
java.lang.NullPointerException
                 at 
org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1411)
                 at 
org.apache.giraph.graph.MasterThread.run(MasterThread.java:111)
2013-03-18 18:23:30,705 WARN org.apache.giraph.zk.ZooKeeperManager: 
onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper 
process.


The workers except for the one who threw the expected exception report the 
following error:

2013-03-18 18:20:54,107 ERROR org.apache.zookeeper.ClientCnxn: Error while 
calling watcher 
java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot 
recover - WatchedEvent state:Disconnected type:None path:null
                 at 
org.apache.giraph.graph.BspService.process(BspService.java:974)
                 at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
                 at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2013-03-18 18:20:55,110 INFO org.apache.zookeeper.ClientCnxn: Opening 
socket connection to server idp30.almaden.ibm.com/172.16.0.30:22181
2013-03-18 18:20:55,111 WARN org.apache.zookeeper.ClientCnxn: Session 
0x13d8037f8100008 for server null, unexpected error, closing socket 
connection and attempting reconnect
java.net.ConnectException: Connection refused
                 at sun.nio.ch.SocketChannelImpl.checkConnect(Native 
Method)
                 at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
                 at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2013-03-18 18:20:55,218 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: 
Initialized cache for UID to User mapping with a cache timeout of 14400 
seconds.
2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
UserName ytian for UID 3005 from the native implementation
2013-03-18 18:20:55,257 WARN org.apache.hadoop.mapred.Child: Error running 
child
java.lang.IllegalStateException: startSuperstep: KeeperException getting 
assignments
                 at 
org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:928)
                 at 
org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:649)
                 at 
org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:891)
                 at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
                 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
                 at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
                 at java.security.AccessController.doPrivileged(Native 
Method)
                 at javax.security.auth.Subject.doAs(Subject.java:396)
                 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
                 at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/_hadoopBsp/job_201303181655_0004/_applicationAttemptsDir/0/_superstepDir/2/_partitionAssignments
                 at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
                 at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
                 at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
                 at 
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
                 at 
org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:909)

Yuanyuan



From:   Avery Ching <ach...@apache.org>
To:     user@giraph.apache.org
Cc:     Yuanyuan Tian/Almaden/IBM@IBMUS
Date:   03/18/2013 03:05 PM
Subject:        Re: about fault tolerance in Giraph



How many retries did you set for hadoop map task failures?  Might want to 
try 10?

Avery

On 3/18/13 2:38 PM, Yuanyuan Tian wrote:
Hi Avery, 

I was just testing how Giraph can handle fault tolerance. I wrote a simple 
algorithm that could run without a problem. Then I artificially added a 
line of code to throw an IOException for the 12th superstep when the 
taskID is the 0001 and attempt ID is 0000. The job returned the excepted 
IOException, but it cannot recover from it. There is no retry of the 
failed task, even though there are empty map slots left in the cluster. 
Eventually, the whole job failed after time out. 

Yuanyuan 



From:        Avery Ching <ach...@apache.org> 
To:        user@giraph.apache.org 
Date:        03/18/2013 02:09 PM 
Subject:        Re: about fault tolerance in Giraph 



Hi Yuanyuan,

We haven't tested this feature in a while.  But it should work.  What did 
the job report about why it failed?

Avery

On 3/18/13 10:22 AM, Yuanyuan Tian wrote: 
Can anyone help me answer the question? 

Yuanyuan 



From:        Yuanyuan Tian/Almaden/IBM@IBMUS 
To:        user@giraph.apache.org 
Date:        03/15/2013 02:05 PM 
Subject:        about fault tolerance in Giraph 



Hi 

I was testing the fault tolerance of Giraph on a long running job. I 
noticed that when one of the worker throw an exception, the whole job 
failed without retrying the task, even though I turned on the 
checkpointing and there were available map slots in my cluster. Why wasn't 
the fault tolerance mechanism working? 

I was running a version of Giraph downloaded sometime in June 2012 and I 
used Netty for the communication layer. 

Thanks, 

Yuanyuan 


Reply via email to