Hi, I want to ask.. I'm using Giraph 1.0.0 with hadoop-0.20.203.0.
I saw a case when a worker cannot give response to master because of the slow connection problem. It is the situation in sending the aggregation. After the master waits for a period of time, then suddenly the worker is killed by JobTracker. Here is the log: *2014-10-21 10:25:31,708 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201410210948_0001_m_000006_0: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)Caused by: java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)2014-11-18 10:25:34,723 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201410210948_0001_m_000006_0'* What confuse me more is, I didn't see master does the checkpoint process here. Instead, the superstep just fails and the master is also killed by JobTracker *2014-10-21 10:32:54,151 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 2054.184 seconds ended with state WORKER_FAILURE and is now on superstep 12014-10-21 10:32:54,929 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with RuntimeExceptionjava.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201410210948_0001/_edgeInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1179) ... 1 more* However, sometimes JobTracker can assign the job to another worker but in my case, it is not always success. My question here is, does master have any role in this case? It seems that I didn't see any recovery (checkpoint) from master in my case. Thanks Regards, Vincentius Martin