[ 
https://issues.apache.org/jira/browse/GIRAPH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884282#comment-13884282
 ] 

Avery Ching commented on GIRAPH-828:
------------------------------------

This is strange.  Does it happen even without the 
-Dgiraph.cleanupCheckpointsAfterSuccess=false option?  You shouldn't need this 
since you are not enabling checkpoints.

> Race condition during Giraph cleanup phase
> ------------------------------------------
>
>                 Key: GIRAPH-828
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-828
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>         Environment: Giraph 1.1,
> Hadoop 2.2.0,
> Java 1.7.0_45
>            Reporter: Kristen Hardwick
>             Fix For: 1.1.0
>
>
> Running the exact same launch command twice, making no other changes, has 
> different completion results. For example the first time the application will 
> fail, and the second time it will succeed. Just for proof, this is what 
> happened when I tried to run the SimpleShortestPathsComputation example: 
> PasteBin Link. This happens consistently, although the job does fail much 
> more often than it succeeds.
> The PageRank example also has the same issue. In fact, the timing issue is 
> even more obvious there. I followed directions 
> [here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran 
> the SimplePageRankComputation example with this command:
> {code}
> hadoop jar 
> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>  org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false 
> -Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false 
> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 
> -Dgiraph.useInputSplitLocality=false 
> org.apache.giraph.examples.SimplePageRankComputation -vif 
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip 
> /user/spry/input -vof 
> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op 
> /user/spry/PageRank -w 2 -mc 
> org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
> {code}
> The job technically failed, but I did get output from part file 1 (I expected 
> to have values printed for all vertices between 0 and 4).
> {code}
> 0    0.16682289373110673
> 4    0.17098446073203233
> 2    0.17098446073203233
> {code}
> I ran the exact same command again (with no changes to the environment except 
> for deleting the /user/spry/PageRank HDFS directory) and got no part files. I 
> ran it one more time and got only the data from part file 2:
> {code}
> 1    0.24178880797750438
> 3    0.24178880797750438
> {code}
> I tried a few more times, but I haven't been able to see both part files in 
> the output directory yet.
> In the logs, I see hopeful things like this:
> {code}
> 14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582 
> seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
> ...
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final 
> job output data.
> {code}
> and like this:
> {code}
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final 
> job output data.
> 14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
> hadoop2.j7.master/127.0.0.1:8020 from yarn: closed
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
> hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0
> {code}
> Really only one of the containers even fails. And it's with a 
> DataStreamer/LeaseExpired exception saying that the part file no longer 
> exists. This log is from the run where part file 2 was not written out:
> {code}
> 14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
>  File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not 
> have any open files.
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
> ...
> 14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got 
> failure, unregistering health on 
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
>  on superstep 30
> 14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply 
> sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null 
> finished:false header:: 589,2  replyHeader:: 589,13968,-101  request:: 
> '/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
>   response:: null
> 14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on 
> another RuntimeException, original expection will be rethrown
> java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't 
> delete 
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
>         at 
> org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to