Sergey Edunov created GIRAPH-972:
------------------------------------
Summary: Race condition in checkpointing
Key: GIRAPH-972
URL: https://issues.apache.org/jira/browse/GIRAPH-972
Project: Giraph
Issue Type: Bug
Reporter: Sergey Edunov
Couple of issues noticed with checkpointing of large jobs:
1) Task ID of master appears to be important. In most cases it is 0, however
sometimes it is not and as we can not control it checkpointing should not
depend on it.
2) Race condition happens on master when worker dies:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
for
/_hadoopBsp/job_201411061513.38895_0001/_applicationAttemptsDir/0/_superstepDir/9/_workerHealthyDir/hadoop4921.prn2.facebook.com_3
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
at org.apache.giraph.zk.ZooKeeperExt.getData(ZooKeeperExt.java:470)
at
org.apache.giraph.utils.WritableUtils.readFieldsFromZnode(WritableUtils.java:126)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)