[ https://issues.apache.org/jira/browse/GIRAPH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884282#comment-13884282 ]
Avery Ching commented on GIRAPH-828: ------------------------------------ This is strange. Does it happen even without the -Dgiraph.cleanupCheckpointsAfterSuccess=false option? You shouldn't need this since you are not enabling checkpoints. > Race condition during Giraph cleanup phase > ------------------------------------------ > > Key: GIRAPH-828 > URL: https://issues.apache.org/jira/browse/GIRAPH-828 > Project: Giraph > Issue Type: Bug > Affects Versions: 1.1.0 > Environment: Giraph 1.1, > Hadoop 2.2.0, > Java 1.7.0_45 > Reporter: Kristen Hardwick > Fix For: 1.1.0 > > > Running the exact same launch command twice, making no other changes, has > different completion results. For example the first time the application will > fail, and the second time it will succeed. Just for proof, this is what > happened when I tried to run the SimpleShortestPathsComputation example: > PasteBin Link. This happens consistently, although the job does fail much > more often than it succeeds. > The PageRank example also has the same issue. In fact, the timing issue is > even more obvious there. I followed directions > [here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran > the SimplePageRankComputation example with this command: > {code} > hadoop jar > giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar > org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false > -Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false > -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 > -Dgiraph.useInputSplitLocality=false > org.apache.giraph.examples.SimplePageRankComputation -vif > org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip > /user/spry/input -vof > org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op > /user/spry/PageRank -w 2 -mc > org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute > {code} > The job technically failed, but I did get output from part file 1 (I expected > to have values printed for all vertices between 0 and 4). > {code} > 0 0.16682289373110673 > 4 0.17098446073203233 > 2 0.17098446073203233 > {code} > I ran the exact same command again (with no changes to the environment except > for deleting the /user/spry/PageRank HDFS directory) and got no part files. I > ran it one more time and got only the data from part file 2: > {code} > 1 0.24178880797750438 > 3 0.24178880797750438 > {code} > I tried a few more times, but I haven't been able to see both part files in > the output directory yet. > In the logs, I see hopeful things like this: > {code} > 14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds. > 14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582 > seconds. > 14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds. > ... > 14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds. > 14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds. > 14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds. > 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final > job output data. > {code} > and like this: > {code} > 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final > job output data. > 14/01/22 09:47:48 DEBUG ipc.Client: Stopping client > 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to > hadoop2.j7.master/127.0.0.1:8020 from yarn: closed > 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to > hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0 > {code} > Really only one of the containers even fails. And it's with a > DataStreamer/LeaseExpired exception saying that the part file no longer > exists. This log is from the run where part file 2 was not written out: > {code} > 14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002: > File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not > have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567) > ... > 14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got > failure, unregistering health on > /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2 > on superstep 30 > 14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply > sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null > finished:false header:: 589,2 replyHeader:: 589,13968,-101 request:: > '/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1 > response:: null > 14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on > another RuntimeException, original expection will be rethrown > java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't > delete > /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2 > at > org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656) > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)