I have tried increase number of reties to 10. But it is the same: no retry at all. Isn't retrying of failed task the default behavior for hadoop? Why isn't it working in the case of Giraph?
Here is the message from master: 2013-03-18 18:23:30,628 ERROR org.apache.giraph.graph.BspServiceMaster: checkWorkers: Did not receive enough processes in time (only 54 of 55 required). This occurs if you do not have enough map tasks available simultaneously on your Hadoop instance to fulfill the number of requested workers. 2013-03-18 18:23:30,628 FATAL org.apache.giraph.graph.BspServiceMaster: coordinateSuperstep: Not enough healthy workers for superstep 12 2013-03-18 18:23:30,629 INFO org.apache.giraph.graph.BspServiceMaster: setJobState: {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on superstep 12 2013-03-18 18:23:30,649 FATAL org.apache.giraph.graph.BspServiceMaster: failJob: Killing job job_201303181655_0004 2013-03-18 18:23:30,703 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.graph.MasterThread, msg = null, exiting... java.lang.NullPointerException at org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1411) at org.apache.giraph.graph.MasterThread.run(MasterThread.java:111) 2013-03-18 18:23:30,705 WARN org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process. The workers except for the one who threw the expected exception report the following error: 2013-03-18 18:20:54,107 ERROR org.apache.zookeeper.ClientCnxn: Error while calling watcher java.lang.RuntimeException: process: Disconnected from ZooKeeper, cannot recover - WatchedEvent state:Disconnected type:None path:null at org.apache.giraph.graph.BspService.process(BspService.java:974) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2013-03-18 18:20:55,110 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server idp30.almaden.ibm.com/172.16.0.30:22181 2013-03-18 18:20:55,111 WARN org.apache.zookeeper.ClientCnxn: Session 0x13d8037f8100008 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2013-03-18 18:20:55,218 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1 2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds. 2013-03-18 18:20:55,254 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName ytian for UID 3005 from the native implementation 2013-03-18 18:20:55,257 WARN org.apache.hadoop.mapred.Child: Error running child java.lang.IllegalStateException: startSuperstep: KeeperException getting assignments at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:928) at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:649) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:891) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201303181655_0004/_applicationAttemptsDir/0/_superstepDir/2/_partitionAssignments at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837) at org.apache.giraph.graph.BspServiceWorker.startSuperstep(BspServiceWorker.java:909) Yuanyuan From: Avery Ching <ach...@apache.org> To: user@giraph.apache.org Cc: Yuanyuan Tian/Almaden/IBM@IBMUS Date: 03/18/2013 03:05 PM Subject: Re: about fault tolerance in Giraph How many retries did you set for hadoop map task failures? Might want to try 10? Avery On 3/18/13 2:38 PM, Yuanyuan Tian wrote: Hi Avery, I was just testing how Giraph can handle fault tolerance. I wrote a simple algorithm that could run without a problem. Then I artificially added a line of code to throw an IOException for the 12th superstep when the taskID is the 0001 and attempt ID is 0000. The job returned the excepted IOException, but it cannot recover from it. There is no retry of the failed task, even though there are empty map slots left in the cluster. Eventually, the whole job failed after time out. Yuanyuan From: Avery Ching <ach...@apache.org> To: user@giraph.apache.org Date: 03/18/2013 02:09 PM Subject: Re: about fault tolerance in Giraph Hi Yuanyuan, We haven't tested this feature in a while. But it should work. What did the job report about why it failed? Avery On 3/18/13 10:22 AM, Yuanyuan Tian wrote: Can anyone help me answer the question? Yuanyuan From: Yuanyuan Tian/Almaden/IBM@IBMUS To: user@giraph.apache.org Date: 03/15/2013 02:05 PM Subject: about fault tolerance in Giraph Hi I was testing the fault tolerance of Giraph on a long running job. I noticed that when one of the worker throw an exception, the whole job failed without retrying the task, even though I turned on the checkpointing and there were available map slots in my cluster. Why wasn't the fault tolerance mechanism working? I was running a version of Giraph downloaded sometime in June 2012 and I used Netty for the communication layer. Thanks, Yuanyuan