Thanks Kristen, I see another email that someone else is having trouble with this. The problem is a shim to make up for the extra task (Application Manager) that non-YARN Hadoop does not use. I think over the time since the original pre-Hadoop-2.2. implementation I did, some stuff has changed and the old shim isn't working any more.
The issue is that I think the GIRAPH-747 solution works for us but will break non-YARN implementation. We need to sit down and figure out a better solution and so far I haven't had time. Thanks for keeping the conversation going, I'm sure one of you, myself, or Mohammed will sit down and code up a better fix soon. Thanks, Eli On Thu, Jan 30, 2014 at 1:29 PM, Kristen Hardwick <khardw...@spryinc.com>wrote: > Eli, Chuan, > > Thanks for taking a look into my issue! GIRAPH-747 definitely seems to > address the exact issue I'm running into, even down to the class I thought > was causing the problem. I created a bug ticket a few days ago (GIRAPH-828) > which has the details of my environment, including the command I'm running > and the full logs where the problem occurs. I just linked my ticket to > GIRAPH-747, but if it makes sense for me to delete mine instead, please let > me know. > > I will definitely put a comment in there so that people watching it are > aware of Chuan's patch. Avery Ching was asking me for more information in > the comments, so he might be able to help validate the solution. > > Thanks again, > Kristen > > > On Wed, Jan 29, 2014 at 9:35 PM, Eli Reisman <apache.mail...@gmail.com>wrote: > >> Sorry, I do think this will solve it and it makes sense people are >> encountering the prob when using -w 1 I'll get this reviewed and committed >> (patch 747) >> >> Mohammed, any objections? >> >> >> >> On Wed, Jan 29, 2014 at 6:22 PM, Chuan Lei <leich...@gmail.com> wrote: >> >>> Hi Kristen, >>> >>> I had this problem before and submitted a Jira ticket (GIRAPH-747) with >>> path. You may want to take a look at it. Hope that can solve your problem. >>> >>> Thanks, >>> Chuan >>> >>> On Jan 29, 2014, at 9:16 PM, Eli Reisman <apache.mail...@gmail.com> >>> wrote: >>> >>> > Hi Kristen, thanks for posting this. During the port to YARN I >>> encountered some race problems with the output sequence. The YARN >>> implementation has to handle this a bit differently than the non-YARN and >>> although we got it figured out at the time, I haven't really looked at it >>> in many months and non-YARN Giraph has evolved quickly since then. Wouldn't >>> shock me if there is trouble here, if I recall the solution seemed a bit >>> delicate. >>> > >>> > If you have some ideas for a patch I'd be happy to review, I am pretty >>> strapped for time right now but if you post a ticket to the Giraph JIRA and >>> no one else attempts a patch I'm sure either myself or Mohammed will take a >>> swipe at it eventually. Thanks! >>> > >>> > Eli >>> > >>> > >>> > On Mon, Jan 20, 2014 at 9:01 AM, Kristen Hardwick < >>> khardw...@spryinc.com> wrote: >>> > Sorry to bug everyone again, but does anyone have any ideas on this? >>> Please let me know if I'm leaving out any crucial information that could >>> get me some help. >>> > >>> > Thanks! >>> > Kristen >>> > >>> > >>> > On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick < >>> khardw...@spryinc.com> wrote: >>> > Hi all, >>> > >>> > I had a very productive day today getting this stuff figured out. >>> Unfortunately, it appears that I've stumbled onto a possible race condition >>> during the cleanup step of the code for the application. >>> > >>> > I put some information here that explains why I think it is a race >>> condition. http://pastebin.com/Qswb98dq Basically, I tried the exact >>> same command twice, making no other changes - the first time it failed and >>> the second time it succeeded. >>> > >>> > This makes me think that the >>> LeaseExpiredException/DataStreamerException is caused because the files >>> have been cleaned up just before they are needed. Possibly inside the >>> BspServiceMaster, but I am not at all sure about that. >>> > >>> > Is anyone already aware of this? Should I log it as a bug? I do have >>> access to (DEBUG) logs of both the successful and failed attempts if anyone >>> wants to see them. >>> > >>> > Thanks, >>> > Kristen Hardwick >>> > >>> > >>> > On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick < >>> khardw...@spryinc.com> wrote: >>> > Hi Avery (or anyone else that knows), >>> > >>> > Could you please give me some details that would help me find the past >>> threads that might address this issue? I searched Google with various >>> combinations of "giraph datastreamer exception yarn lease expired >>> zookeeper" and didn't really come up with anything that seemed relevant. >>> > >>> > Is it possible that it's just a memory issue on my end? I'm running >>> inside a VM - a single node cluster with 8 GB of memory allocated to it. >>> Could that have anything to do with it? Right now I'm investigating the >>> code to try to lower the amount of memory allocated to the containers. >>> > >>> > Thanks, >>> > Kristen >>> > >>> > >>> > On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <ach...@apache.org> >>> wrote: >>> > This looks more like the Zookeeper/YARN issues mentioned in the past. >>> Unfortunately, I do not have a YARN instance to test this with. Does >>> anyone else have any insights here? >>> > >>> > >>> > On 1/10/14 1:48 PM, Kristen Hardwick wrote: >>> >> Hi all, I'm requesting help again! I'm trying to get this >>> SimpleShortestPathsComputation example working, but I'm stuck again. Now >>> the job begins to run and seems to work until the final step (it performs 3 >>> supersteps), but the overall job is failing. >>> >> >>> >> In the master, among other things, I see: >>> >> >>> >> ... >>> >> 14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds. >>> >> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took >>> 0.708 seconds. >>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158 >>> seconds. >>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344 >>> seconds. >>> >> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064 >>> seconds. >>> >> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162 >>> seconds. >>> >> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds. >>> >> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit >>> final job output data. >>> >> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the >>> final job output data. >>> >> ... >>> >> >>> >> To me, that looks promising - like the job was successful. However, >>> in the WORKER_ONLY containers, I see these things: >>> >> >>> >> ... >>> >> 14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for >>> WORKER_ONLY >>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and >>> unprocessed event >>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions, >>> type=NodeDeleted, state=SyncConnected) >>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent : >>> partitionExchangeChildrenChanged (at least one worker is done sending >>> partitions) >>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and >>> unprocessed event >>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished, >>> type=NodeDeleted, state=SyncConnected) >>> >> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait >>> threshold, 1 connections closed, releasing NettyClient.bootstrap resources >>> now. >>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job >>> state changed, checking to see if it needs to restart >>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already >>> exists >>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState) >>> >> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1] >>> saveVertices: Starting to save 2 vertices using 1 threads >>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: >>> Starting to save 2 vertices using 1 threads >>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job >>> state changed, checking to see if it needs to restart >>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already >>> exists >>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState) >>> >> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is >>> empty! - >>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState >>> >> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling >>> watcher >>> >> java.lang.NullPointerException >>> >> at java.io.StringReader.<init>(StringReader.java:50) >>> >> at org.json.JSONTokener.<init>(JSONTokener.java:66) >>> >> at org.json.JSONObject.<init>(JSONObject.java:402) >>> >> at >>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716) >>> >> at >>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563) >>> >> at >>> org.apache.giraph.bsp.BspService.process(BspService.java:1095) >>> >> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) >>> >> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and >>> unprocessed event >>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady, >>> type=NodeDeleted, state=SyncConnected) >>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and >>> unprocessed event >>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions, >>> type=NodeDeleted, state=SyncConnected) >>> >> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent : >>> partitionExchangeChildrenChanged (at least one worker is done sending >>> partitions) >>> >> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and >>> unprocessed event >>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished, >>> type=NodeDeleted, state=SyncConnected) >>> >> ... >>> >> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception >>> >> >>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >>> No lease on >>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001: >>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not >>> have any open files. >>> >> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755) >>> >> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567) >>> >> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480) >>> >> at >>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) >>> >> at >>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) >>> >> ... >>> >> >>> >> I apologize for the wall of error message, but I tried to leave in at >>> least some of the parts that might be useful. I put the entire YARN log >>> here: http://tny.cz/af229738 >>> >> >>> >> Has anyone ever seen this before? This is the command I'm using to >>> run: >>> >> >>> >> hadoop jar >>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar >>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false >>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 >>> -Dgiraph.useInputSplitLocality=false >>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif >>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat >>> -vip /user/spry/input -vof >>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op >>> /user/spry/Shortest -w 1 >>> >> >>> >> My setup is still the same as the other email if you saw it: >>> >> >>> >> I compiled Giraph with this command, and everything built >>> successfully except "Apache Giraph Distribution" which it doesn't seem like >>> I need: >>> >> >>> >> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package >>> >> >>> >> I am running with the following components: >>> >> >>> >> Single node cluster >>> >> Giraph 1.1 >>> >> Hadoop 2.2.0 (Hortonworks) >>> >> Java 1.7.0_45 >>> >> >>> >> Thanks in advance, >>> >> -Kristen Hardwick >>> >> >>> > >>> > >>> > >>> > >>> > >>> >>> >> >