Hi, I'm using Giraph 1.0.0 and I ran RandomMessageBenchmark in Giraph.
In the middle of the process I tried killing a hadoop task (= a worker). Suddenly the process just failed with the following exception in master 2014-11-29 04:40:18,049 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 611.669 seconds ended with state WORKER_FAILURE and is now on superstep 1 2014-11-29 04:40:18,313 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with RuntimeException java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201411290417_0003/_edgeInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1179) ... 1 more 2014-11-29 04:40:18,315 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: restartFromCheckpoint: KeeperException, exiting... java.lang.IllegalStateException: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.MasterThread.run(MasterThread.java:181) Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201411290417_0003/_edgeInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1179) Is this some kind of bug in Giraph? What I see from the log is: master is trying to do restartFromCheckpoint but it failed. How can I activate a checkpoint situation in Giraph? Thanks Regards, Vincentius Martin