Hi all,

I had a very productive day today getting this stuff figured out.
Unfortunately, it appears that I've stumbled onto a possible race condition
during the cleanup step of the code for the application.

I put some information here that explains why I think it is a race
condition. http://pastebin.com/Qswb98dq Basically, I tried the exact same
command twice, making no other changes - the first time it failed and the
second time it succeeded.

This makes me think that the LeaseExpiredException/DataStreamerException is
caused because the files have been cleaned up just before they are needed.
Possibly inside the BspServiceMaster, but I am not at all sure about that.

Is anyone already aware of this? Should I log it as a bug? I do have access
to (DEBUG) logs of both the successful and failed attempts if anyone wants
to see them.

Kristen Hardwick

On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <khardw...@spryinc.com>wrote:

> Hi Avery (or anyone else that knows),
> Could you please give me some details that would help me find the past
> threads that might address this issue? I searched Google with various
> combinations of "giraph datastreamer exception yarn lease expired
> zookeeper" and didn't really come up with anything that seemed relevant.
> Is it possible that it's just a memory issue on my end? I'm running inside
> a VM - a single node cluster with 8 GB of memory allocated to it. Could
> that have anything to do with it? Right now I'm investigating the code to
> try to lower the amount of memory allocated to the containers.
> Thanks,
> Kristen
> On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <ach...@apache.org> wrote:
>>  This looks more like the Zookeeper/YARN issues mentioned in the past.
>> Unfortunately, I do not have a YARN instance to test this with.  Does
>> anyone else have any insights here?
>> On 1/10/14 1:48 PM, Kristen Hardwick wrote:
>>  Hi all, I'm requesting help again! I'm trying to get this
>> SimpleShortestPathsComputation example working, but I'm stuck again. Now
>> the job begins to run and seems to work until the final step (it performs 3
>> supersteps), but the overall job is failing.
>>  In the master, among other things, I see:
>>  ...
>>  14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took 0.708
>> seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158
>> seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344
>> seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064
>> seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162 seconds.
>> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds.
>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit
>> final job output data.
>> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the
>> final job output data.
>>  ...
>>  To me, that looks promising - like the job was successful. However, in
>> the WORKER_ONLY containers, I see these things:
>>  ...
>>  14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions,
>> type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>> partitionExchangeChildrenChanged (at least one worker is done sending
>> partitions)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait threshold, 1
>> connections closed, releasing NettyClient.bootstrap resources now.
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state
>> changed, checking to see if it needs to restart
>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>> exists
>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1]
>> saveVertices: Starting to save 2 vertices using 1 threads
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: Starting to
>> save 2 vertices using 1 threads
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state
>> changed, checking to see if it needs to restart
>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already
>> exists
>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState)
>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is
>> empty! -
>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState
>> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling watcher
>> java.lang.NullPointerException
>>         at java.io.StringReader.<init>(StringReader.java:50)
>>         at org.json.JSONTokener.<init>(JSONTokener.java:66)
>>         at org.json.JSONObject.<init>(JSONObject.java:402)
>>         at
>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716)
>>         at
>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563)
>>         at org.apache.giraph.bsp.BspService.process(BspService.java:1095)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>>         at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady,
>> type=NodeDeleted, state=SyncConnected)
>>  14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions,
>> type=NodeDeleted, state=SyncConnected)
>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent :
>> partitionExchangeChildrenChanged (at least one worker is done sending
>> partitions)
>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>> ...
>> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception
>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>> No lease on
>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001:
>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not
>> have any open files.
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480)
>>         at
>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555)
>>         at
>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>>  ...
>>  I apologize for the wall of error message, but I tried to leave in at
>> least some of the parts that might be useful. I put the entire YARN log
>> here: http://tny.cz/af229738
>>  Has anyone ever seen this before? This is the command I'm using to run:
>>  hadoop jar
>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false
>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
>> -Dgiraph.useInputSplitLocality=false
>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif
>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
>> -vip /user/spry/input -vof
>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
>> /user/spry/Shortest -w 1
>>  My setup is still the same as the other email if you saw it:
>>  I compiled Giraph with this command, and everything built successfully
>> except "Apache Giraph Distribution" which it doesn't seem like I need:
>> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package
>> I am running with the following components:
>>  Single node cluster
>>  Giraph 1.1
>>  Hadoop 2.2.0 (Hortonworks)
>>  Java 1.7.0_45
>>  Thanks in advance,
>> -Kristen Hardwick

