[
https://issues.apache.org/jira/browse/GIRAPH-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252615#comment-14252615
]
Hudson commented on GIRAPH-972:
-------------------------------
ABORTED: Integrated in Giraph-trunk-Commit #1507 (See
[https://builds.apache.org/job/Giraph-trunk-Commit/1507/])
GIRAPH-972 Race condition in checkpointing (edunov:
http://git-wip-us.apache.org/repos/asf?p=giraph.git&a=commit&h=7f2d58445e2353a1a42fbb4282ed5cad724186b5)
* giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java
* CHANGELOG
* giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java
* giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java
> Race condition in checkpointing
> -------------------------------
>
> Key: GIRAPH-972
> URL: https://issues.apache.org/jira/browse/GIRAPH-972
> Project: Giraph
> Issue Type: Bug
> Reporter: Sergey Edunov
>
> Couple of issues noticed with checkpointing of large jobs:
> 1) Task ID of master appears to be important. In most cases it is 0, however
> sometimes it is not and as we can not control it checkpointing should not
> depend on it.
> 2) Race condition happens on master when worker dies:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for
> /_hadoopBsp/job_201411061513.38895_0001/_applicationAttemptsDir/0/_superstepDir/9/_workerHealthyDir/hadoop4921.prn2.facebook.com_3
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1180)
> at org.apache.giraph.zk.ZooKeeperExt.getData(ZooKeeperExt.java:470)
> at
> org.apache.giraph.utils.WritableUtils.readFieldsFromZnode(WritableUtils.java:126)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)