creating hive packages for spark
Hello Spark developers, I want to understand the procedure to create the org.spark-project.hive jars. Is this documented somewhere? I am having issues with -Phive-provided with my private hive13 jars and want to check if using spark's procedure helps.
Re: creating hive packages for spark
Hi, you can build spark-project hive from here : https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf Hope this helps. On Mon, Apr 27, 2015 at 3:23 PM, Manku Timma manku.tim...@gmail.com wrote: Hello Spark developers, I want to understand the procedure to create the org.spark-project.hive jars. Is this documented somewhere? I am having issues with -Phive-provided with my private hive13 jars and want to check if using spark's procedure helps. -- When events unfold with calm and ease When the winds that blow are merely breeze Learn from nature, from birds and bees Live your life in love, and let joy not cease.
Re: Plans for upgrading Hive dependency?
Thanks Marcelo and Patrick - I don't know how I missed that ticket in my Jira search earlier. Is anybody working on the sub-issues yet, or is there a design doc I should look at before taking a stab? Regards, Punya On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell pwend...@gmail.com wrote: Hey Punya, There is some ongoing work to help make Hive upgrades more manageable and allow us to support multiple versions of Hive. Once we do that, it will be much easier for us to upgrade. https://issues.apache.org/jira/browse/SPARK-6906 - Patrick On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin van...@cloudera.com wrote: That's a lot more complicated than you might think. We've done some basic work to get HiveContext to compile against Hive 1.1.0. Here's the code: https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec We didn't sent that upstream because that only solves half of the problem; the hive-thriftserver is disabled in our CDH build because it uses a lot of Hive APIs that have been removed in 1.1.0, so even getting it to compile is really complicated. If there's interest in getting the HiveContext part fixed up I can send a PR for that code. But at this time I don't really have plans to look at the thrift server. On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Is there a plan for staying up-to-date with current (and future) versions of Hive? Spark currently supports version 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about updating beyond 0.13, so I was wondering if this was intentional or it was just that nobody had started work on this yet. I'd be happy to work on a PR for the upgrade if one of the core developers can tell me what pitfalls to watch out for. Punya -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Plans for upgrading Hive dependency?
That's a lot more complicated than you might think. We've done some basic work to get HiveContext to compile against Hive 1.1.0. Here's the code: https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec We didn't sent that upstream because that only solves half of the problem; the hive-thriftserver is disabled in our CDH build because it uses a lot of Hive APIs that have been removed in 1.1.0, so even getting it to compile is really complicated. If there's interest in getting the HiveContext part fixed up I can send a PR for that code. But at this time I don't really have plans to look at the thrift server. On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Is there a plan for staying up-to-date with current (and future) versions of Hive? Spark currently supports version 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about updating beyond 0.13, so I was wondering if this was intentional or it was just that nobody had started work on this yet. I'd be happy to work on a PR for the upgrade if one of the core developers can tell me what pitfalls to watch out for. Punya -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is there any particular reason why there's no Java counterpart in Streaming Guide's Design Patterns for using foreachRDD section?
My guess is since it says for example (in Scala) that this started as Scala-only and then Python was tacked on as a one-off, and Java never got added. I think you'd be welcome to add it. It's not an obscure example and one people might want to see in Java. On Mon, Apr 27, 2015 at 4:34 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, Is there any particular reason why there's no Java counterpart in Streaming Guide's Design Patterns for using foreachRDD section? https://spark.apache.org/docs/latest/streaming-programming-guide.html Up to that point, each source code example includes corresponding Java (and sometimes Python) source code for the Scala examples, but in section Design Patterns for using foreachRDD, the code examples are only in Scala and Python. After that section comes DataFrame and SQL Operations, and it continues giving examples in Scala, Java, and Python. The reason I'm asking: if there's no particular reason, maybe I can open a JIRA ticket and contribute to that part of the documentation? -- Emre Sevinç - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Exception in using updateStateByKey
Hi, all: I use function updateStateByKey in Spark Streaming, I need to store the states for one minite, I set spark.cleaner.ttl to 120, the duration is 2 seconds, but it throws Exception Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) Why? my code is ssc = StreamingContext(sc,2) kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1}) kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\ .filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \ .filter(lambda x: x[1]['isExisted'] != 1) \ .foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))
java.lang.StackOverflowError when recovery from checkpoint in Streaming
Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) to read data from kafka(1k/second), and store the data in windows,the code snippets as follow:val windowedStreamChannel = streamChannel.combineByKey[TreeSet[Obj]](TreeSet[Obj](_), _ += _, _ ++= _, new HashPartitioner(numPartition)) .reduceByKeyAndWindow((x: TreeSet[Obj], y: TreeSet[Obj]) = x ++= y, (x: TreeSet[Obj], y: TreeSet[Obj]) = x --= y, Minutes(60), Seconds(2), numPartition, (item: (String, TreeSet[Obj])) = item._2.size != 0)after the application run for an hour, I kill the application and restart it from checkpoint directory, but I encountered an exception:2015-04-27 17:52:40,955 INFO [Driver] - Slicing from 1430126222000 ms to 1430126222000 ms (aligned to 1430126222000 ms and 1430126222000 ms) 2015-04-27 17:52:40,958 ERROR [Driver] - User class threw exception: null java.lang.StackOverflowError at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:813) at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080) at sun.misc.URLClassPath.getResource(URLClassPath.java:199) at java.net.URLClassLoader$1.run(URLClassLoader.java:358) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:1623) at org.apache.spark.rdd.RDD.filter(RDD.scala:303) at org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
Re: Design docs: consolidation and discoverability
I like the idea of having design docs be kept up to date and tracked in git. If the Apache repo isn't a good fit, perhaps we can have a separate repo just for design docs? Maybe something like github.com/spark-docs/spark-docs/ ? If there's other stuff we want to track but haven't, perhaps we can generalize the purpose of the repo a bit and rename it accordingly (e.g. spark-misc/spark-misc). Nick On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com wrote: My only issue with Google Docs is that they're mutable, so it's difficult to follow a design's history through its revisions and link up JIRA comments with the relevant version. -Sandy On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com wrote: One thing to consider is that while docs as PDFs in JIRAs do document the original proposal, that's not the place to keep living specifications. That stuff needs to live in SCM, in a format which can be easily maintained, can generate readable documents, and, in an unrealistically ideal world, even be used by machines to validate compliance with the design. Test suites tend to be the implicit machine-readable part of the specification, though they aren't usually viewed as such. PDFs of word docs in JIRAs are not the place for ongoing work, even if the early drafts can contain them. Given it's just as easy to point to markdown docs in github by commit ID, that could be an alternative way to publish docs, with the document itself being viewed as one of the deliverables. When the time comes to update a document, then its there in the source tree to edit. If there's a flaw here, its that design docs are that: the design. The implementation may not match, ongoing work will certainly diverge. If the design docs aren't kept in sync, then they can mislead people. Accordingly, once the design docs are incorporated into the source tree, keeping them in sync with changes has be viewed as essential as keeping tests up to date On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote: I actually don't totally see why we can't use Google Docs provided it is clearly discoverable from the JIRA. It was my understanding that many projects do this. Maybe not (?). If it's a matter of maintaining public record on ASF infrastructure, perhaps we can just automate that if an issue is closed we capture the doc content and attach it to the JIRA as a PDF. My sense is that in general the ASF infrastructure policy is becoming more and more lenient with regards to using third party services, provided the are broadly accessible (such as a public google doc) and can be definitively archived on ASF controlled storage. - Patrick On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote: I know I recently used Google Docs from a JIRA, so am guilty as charged. I don't think there are a lot of design docs in general, but the ones I've seen have simply pushed docs to a JIRA. (I did the same, mirroring PDFs of the Google Doc.) I don't think this is hard to follow. I think you can do what you like: make a JIRA and attach files. Make a WIP PR and attach your notes. Make a Google Doc if you're feeling transgressive. I don't see much of a problem to solve here. In practice there are plenty of workable options, all of which are mainstream, and so I do not see an argument that somehow this is solved by letting people make wikis. On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc in the description of the ticket, it would be relatively easy to figure out what designs are in progress. Given the push-back against design docs in Git or on the wiki and the strong preference for keeping docs on ASF property, I'm a bit surprised that all the existing design docs are on Google Docs. Perhaps Apache should consider opening up parts of the wiki to a larger group, to better serve this use case. Punya On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com wrote: Using our ASF git repository as a working area for design docs, it seems potentially concerning to me. It's difficult process wise because all commits need to go through committers and also, we'd pollute our git history a lot with random incremental design updates. The git history is used a lot by downstream packagers, us during our QA process, etc... we really try to keep it oriented around code patches:
Re: Design docs: consolidation and discoverability
Nick, I like your idea of keeping it in a separate git repository. It seems to combine the advantages of the present Google Docs approach with the crisper history, discoverability, and text format simplicity of GitHub wikis. Punya On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I like the idea of having design docs be kept up to date and tracked in git. If the Apache repo isn't a good fit, perhaps we can have a separate repo just for design docs? Maybe something like github.com/spark-docs/spark-docs/ ? If there's other stuff we want to track but haven't, perhaps we can generalize the purpose of the repo a bit and rename it accordingly (e.g. spark-misc/spark-misc). Nick On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com wrote: My only issue with Google Docs is that they're mutable, so it's difficult to follow a design's history through its revisions and link up JIRA comments with the relevant version. -Sandy On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com wrote: One thing to consider is that while docs as PDFs in JIRAs do document the original proposal, that's not the place to keep living specifications. That stuff needs to live in SCM, in a format which can be easily maintained, can generate readable documents, and, in an unrealistically ideal world, even be used by machines to validate compliance with the design. Test suites tend to be the implicit machine-readable part of the specification, though they aren't usually viewed as such. PDFs of word docs in JIRAs are not the place for ongoing work, even if the early drafts can contain them. Given it's just as easy to point to markdown docs in github by commit ID, that could be an alternative way to publish docs, with the document itself being viewed as one of the deliverables. When the time comes to update a document, then its there in the source tree to edit. If there's a flaw here, its that design docs are that: the design. The implementation may not match, ongoing work will certainly diverge. If the design docs aren't kept in sync, then they can mislead people. Accordingly, once the design docs are incorporated into the source tree, keeping them in sync with changes has be viewed as essential as keeping tests up to date On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote: I actually don't totally see why we can't use Google Docs provided it is clearly discoverable from the JIRA. It was my understanding that many projects do this. Maybe not (?). If it's a matter of maintaining public record on ASF infrastructure, perhaps we can just automate that if an issue is closed we capture the doc content and attach it to the JIRA as a PDF. My sense is that in general the ASF infrastructure policy is becoming more and more lenient with regards to using third party services, provided the are broadly accessible (such as a public google doc) and can be definitively archived on ASF controlled storage. - Patrick On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote: I know I recently used Google Docs from a JIRA, so am guilty as charged. I don't think there are a lot of design docs in general, but the ones I've seen have simply pushed docs to a JIRA. (I did the same, mirroring PDFs of the Google Doc.) I don't think this is hard to follow. I think you can do what you like: make a JIRA and attach files. Make a WIP PR and attach your notes. Make a Google Doc if you're feeling transgressive. I don't see much of a problem to solve here. In practice there are plenty of workable options, all of which are mainstream, and so I do not see an argument that somehow this is solved by letting people make wikis. On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc in the description of the ticket, it would be relatively easy to figure out what designs are in progress. Given the push-back against design docs in Git or on the wiki and the strong preference for keeping docs on ASF property, I'm a bit surprised that all the existing design docs are on Google Docs. Perhaps Apache should consider opening up parts of the wiki to a larger group, to better serve this use case. Punya On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com wrote:
Re: Design docs: consolidation and discoverability
My only issue with Google Docs is that they're mutable, so it's difficult to follow a design's history through its revisions and link up JIRA comments with the relevant version. -Sandy On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com wrote: One thing to consider is that while docs as PDFs in JIRAs do document the original proposal, that's not the place to keep living specifications. That stuff needs to live in SCM, in a format which can be easily maintained, can generate readable documents, and, in an unrealistically ideal world, even be used by machines to validate compliance with the design. Test suites tend to be the implicit machine-readable part of the specification, though they aren't usually viewed as such. PDFs of word docs in JIRAs are not the place for ongoing work, even if the early drafts can contain them. Given it's just as easy to point to markdown docs in github by commit ID, that could be an alternative way to publish docs, with the document itself being viewed as one of the deliverables. When the time comes to update a document, then its there in the source tree to edit. If there's a flaw here, its that design docs are that: the design. The implementation may not match, ongoing work will certainly diverge. If the design docs aren't kept in sync, then they can mislead people. Accordingly, once the design docs are incorporated into the source tree, keeping them in sync with changes has be viewed as essential as keeping tests up to date On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote: I actually don't totally see why we can't use Google Docs provided it is clearly discoverable from the JIRA. It was my understanding that many projects do this. Maybe not (?). If it's a matter of maintaining public record on ASF infrastructure, perhaps we can just automate that if an issue is closed we capture the doc content and attach it to the JIRA as a PDF. My sense is that in general the ASF infrastructure policy is becoming more and more lenient with regards to using third party services, provided the are broadly accessible (such as a public google doc) and can be definitively archived on ASF controlled storage. - Patrick On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote: I know I recently used Google Docs from a JIRA, so am guilty as charged. I don't think there are a lot of design docs in general, but the ones I've seen have simply pushed docs to a JIRA. (I did the same, mirroring PDFs of the Google Doc.) I don't think this is hard to follow. I think you can do what you like: make a JIRA and attach files. Make a WIP PR and attach your notes. Make a Google Doc if you're feeling transgressive. I don't see much of a problem to solve here. In practice there are plenty of workable options, all of which are mainstream, and so I do not see an argument that somehow this is solved by letting people make wikis. On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc in the description of the ticket, it would be relatively easy to figure out what designs are in progress. Given the push-back against design docs in Git or on the wiki and the strong preference for keeping docs on ASF property, I'm a bit surprised that all the existing design docs are on Google Docs. Perhaps Apache should consider opening up parts of the wiki to a larger group, to better serve this use case. Punya On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com wrote: Using our ASF git repository as a working area for design docs, it seems potentially concerning to me. It's difficult process wise because all commits need to go through committers and also, we'd pollute our git history a lot with random incremental design updates. The git history is used a lot by downstream packagers, us during our QA process, etc... we really try to keep it oriented around code patches: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog Committing a polished design doc along with a feature, maybe that's something we could consider. But I still think JIRA is the best location for these docs, consistent with what most other ASF projects do that I know. On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger c...@koeninger.org wrote: Why can't pull requests be used for design docs in Git if people who aren't committers want to contribute changes (as opposed to just comments)? On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen so...@cloudera.com wrote:
Re: github pull request builder FAIL, now WIN(-ish)
sure, i'll kill all of the current spark prb build... On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote: Shane - can we purge all the outstanding builds so we are not running stuff against stale PRs? On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
Re: github pull request builder FAIL, now WIN(-ish)
never mind, looks like you guys are already on it. :) On Mon, Apr 27, 2015 at 11:35 AM, shane knapp skn...@berkeley.edu wrote: sure, i'll kill all of the current spark prb build... On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote: Shane - can we purge all the outstanding builds so we are not running stuff against stale PRs? On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
Plans for upgrading Hive dependency?
Dear Spark devs, Is there a plan for staying up-to-date with current (and future) versions of Hive? Spark currently supports version 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about updating beyond 0.13, so I was wondering if this was intentional or it was just that nobody had started work on this yet. I'd be happy to work on a PR for the upgrade if one of the core developers can tell me what pitfalls to watch out for. Punya
Re: Design docs: consolidation and discoverability
Github's wiki is just another Git repo. If we use a separate repo, it's probably easiest to use the wiki git repo rather than the primary git repo. Punya On Mon, Apr 27, 2015 at 1:50 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Oh, a GitHub wiki (which is separate from having docs in a repo) is yet another approach we could take, though if we want to do that on the main Spark repo we'd need permission from Apache, which may be tough to get... On Mon, Apr 27, 2015 at 1:47 PM Punyashloka Biswal punya.bis...@gmail.com wrote: Nick, I like your idea of keeping it in a separate git repository. It seems to combine the advantages of the present Google Docs approach with the crisper history, discoverability, and text format simplicity of GitHub wikis. Punya On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I like the idea of having design docs be kept up to date and tracked in git. If the Apache repo isn't a good fit, perhaps we can have a separate repo just for design docs? Maybe something like github.com/spark-docs/spark-docs/ ? If there's other stuff we want to track but haven't, perhaps we can generalize the purpose of the repo a bit and rename it accordingly (e.g. spark-misc/spark-misc). Nick On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com wrote: My only issue with Google Docs is that they're mutable, so it's difficult to follow a design's history through its revisions and link up JIRA comments with the relevant version. -Sandy On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com wrote: One thing to consider is that while docs as PDFs in JIRAs do document the original proposal, that's not the place to keep living specifications. That stuff needs to live in SCM, in a format which can be easily maintained, can generate readable documents, and, in an unrealistically ideal world, even be used by machines to validate compliance with the design. Test suites tend to be the implicit machine-readable part of the specification, though they aren't usually viewed as such. PDFs of word docs in JIRAs are not the place for ongoing work, even if the early drafts can contain them. Given it's just as easy to point to markdown docs in github by commit ID, that could be an alternative way to publish docs, with the document itself being viewed as one of the deliverables. When the time comes to update a document, then its there in the source tree to edit. If there's a flaw here, its that design docs are that: the design. The implementation may not match, ongoing work will certainly diverge. If the design docs aren't kept in sync, then they can mislead people. Accordingly, once the design docs are incorporated into the source tree, keeping them in sync with changes has be viewed as essential as keeping tests up to date On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote: I actually don't totally see why we can't use Google Docs provided it is clearly discoverable from the JIRA. It was my understanding that many projects do this. Maybe not (?). If it's a matter of maintaining public record on ASF infrastructure, perhaps we can just automate that if an issue is closed we capture the doc content and attach it to the JIRA as a PDF. My sense is that in general the ASF infrastructure policy is becoming more and more lenient with regards to using third party services, provided the are broadly accessible (such as a public google doc) and can be definitively archived on ASF controlled storage. - Patrick On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote: I know I recently used Google Docs from a JIRA, so am guilty as charged. I don't think there are a lot of design docs in general, but the ones I've seen have simply pushed docs to a JIRA. (I did the same, mirroring PDFs of the Google Doc.) I don't think this is hard to follow. I think you can do what you like: make a JIRA and attach files. Make a WIP PR and attach your notes. Make a Google Doc if you're feeling transgressive. I don't see much of a problem to solve here. In practice there are plenty of workable options, all of which are mainstream, and so I do not see an argument that somehow this is solved by letting people make wikis. On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc
Re: github pull request builder FAIL, now WIN(-ish)
Shane - can we purge all the outstanding builds so we are not running stuff against stale PRs? On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
github pull request builder FAIL, now WIN(-ish)
somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
Re: github pull request builder FAIL, now WIN(-ish)
anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane
Re: github pull request builder FAIL, now WIN(-ish)
And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote: anyways, the build queue is SLAMMED... we're going to need at least a day to catch up w/this. i'll be keeping an eye on system loads and whatnot all day today. whee! On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote: somehow, the power outage on friday caused the pull request builder to lose it's config entirely... i'm not sure why, but after i added the oauth token back, we're now catching up on the weekend's pull request builds. have i mentioned how much i hate this plugin? ;) sorry for the inconvenience... shane