Sergey Edunov created GIRAPH-972: ------------------------------------ Summary: Race condition in checkpointing Key: GIRAPH-972 URL: https://issues.apache.org/jira/browse/GIRAPH-972 Project: Giraph Issue Type: Bug Reporter: Sergey Edunov
Couple of issues noticed with checkpointing of large jobs: 1) Task ID of master appears to be important. In most cases it is 0, however sometimes it is not and as we can not control it checkpointing should not depend on it. 2) Race condition happens on master when worker dies: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201411061513.38895_0001/_applicationAttemptsDir/0/_superstepDir/9/_workerHealthyDir/hadoop4921.prn2.facebook.com_3 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180) at org.apache.giraph.zk.ZooKeeperExt.getData(ZooKeeperExt.java:470) at org.apache.giraph.utils.WritableUtils.readFieldsFromZnode(WritableUtils.java:126) -- This message was sent by Atlassian JIRA (v6.3.4#6332)