creating hive packages for spark

2015-04-27 Thread Manku Timma
Hello Spark developers,
I want to understand the procedure to create the org.spark-project.hive
jars. Is this documented somewhere? I am having issues with -Phive-provided
with my private hive13 jars and want to check if using spark's procedure
helps.


Re: creating hive packages for spark

2015-04-27 Thread yash datta
Hi,

you can build spark-project hive from here :

https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf

Hope this helps.


On Mon, Apr 27, 2015 at 3:23 PM, Manku Timma manku.tim...@gmail.com wrote:

 Hello Spark developers,
 I want to understand the procedure to create the org.spark-project.hive
 jars. Is this documented somewhere? I am having issues with -Phive-provided
 with my private hive13 jars and want to check if using spark's procedure
 helps.




-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.


Re: Plans for upgrading Hive dependency?

2015-04-27 Thread Punyashloka Biswal
Thanks Marcelo and Patrick - I don't know how I missed that ticket in my
Jira search earlier. Is anybody working on the sub-issues yet, or is there
a design doc I should look at before taking a stab?

Regards,
Punya

On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell pwend...@gmail.com wrote:

 Hey Punya,

 There is some ongoing work to help make Hive upgrades more manageable
 and allow us to support multiple versions of Hive. Once we do that, it
 will be much easier for us to upgrade.

 https://issues.apache.org/jira/browse/SPARK-6906

 - Patrick

 On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  That's a lot more complicated than you might think.
 
  We've done some basic work to get HiveContext to compile against Hive
  1.1.0. Here's the code:
 
 https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec
 
  We didn't sent that upstream because that only solves half of the
  problem; the hive-thriftserver is disabled in our CDH build because it
  uses a lot of Hive APIs that have been removed in 1.1.0, so even
  getting it to compile is really complicated.
 
  If there's interest in getting the HiveContext part fixed up I can
  send a PR for that code. But at this time I don't really have plans to
  look at the thrift server.
 
 
  On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
  punya.bis...@gmail.com wrote:
  Dear Spark devs,
 
  Is there a plan for staying up-to-date with current (and future)
 versions
  of Hive? Spark currently supports version 0.13 (June 2014), but the
 latest
  version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets
 about
  updating beyond 0.13, so I was wondering if this was intentional or it
 was
  just that nobody had started work on this yet.
 
  I'd be happy to work on a PR for the upgrade if one of the core
 developers
  can tell me what pitfalls to watch out for.
 
  Punya
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



Re: Plans for upgrading Hive dependency?

2015-04-27 Thread Marcelo Vanzin
That's a lot more complicated than you might think.

We've done some basic work to get HiveContext to compile against Hive
1.1.0. Here's the code:
https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec

We didn't sent that upstream because that only solves half of the
problem; the hive-thriftserver is disabled in our CDH build because it
uses a lot of Hive APIs that have been removed in 1.1.0, so even
getting it to compile is really complicated.

If there's interest in getting the HiveContext part fixed up I can
send a PR for that code. But at this time I don't really have plans to
look at the thrift server.


On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
punya.bis...@gmail.com wrote:
 Dear Spark devs,

 Is there a plan for staying up-to-date with current (and future) versions
 of Hive? Spark currently supports version 0.13 (June 2014), but the latest
 version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about
 updating beyond 0.13, so I was wondering if this was intentional or it was
 just that nobody had started work on this yet.

 I'd be happy to work on a PR for the upgrade if one of the core developers
 can tell me what pitfalls to watch out for.

 Punya



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is there any particular reason why there's no Java counterpart in Streaming Guide's Design Patterns for using foreachRDD section?

2015-04-27 Thread Sean Owen
My guess is since it says for example (in Scala) that this started
as Scala-only and then Python was tacked on as a one-off, and Java
never got added. I think you'd be welcome to add it. It's not an
obscure example and one people might want to see in Java.

On Mon, Apr 27, 2015 at 4:34 AM, Emre Sevinc emre.sev...@gmail.com wrote:
 Hello,

 Is there any particular reason why there's no Java counterpart in Streaming
 Guide's Design Patterns for using foreachRDD section?

https://spark.apache.org/docs/latest/streaming-programming-guide.html

 Up to that point, each source code example includes corresponding Java (and
 sometimes Python) source code for the Scala examples, but in section
 Design Patterns for using foreachRDD, the code examples are only in Scala
 and Python.

 After that section comes DataFrame and SQL Operations, and it continues
 giving examples in Scala, Java, and Python.

 The reason I'm asking: if there's no particular reason, maybe I can open a
 JIRA ticket and contribute to that part of the documentation?

 --
 Emre Sevinç

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Exception in using updateStateByKey

2015-04-27 Thread Sea
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

java.lang.StackOverflowError when recovery from checkpoint in Streaming

2015-04-27 Thread wyphao.2007
 Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, 
String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) to read data 
from kafka(1k/second), and store the data in windows,the code snippets as 
follow:val windowedStreamChannel = 
streamChannel.combineByKey[TreeSet[Obj]](TreeSet[Obj](_), _ += _, _ ++= _, new 
HashPartitioner(numPartition))
  .reduceByKeyAndWindow((x: TreeSet[Obj], y: TreeSet[Obj]) = x ++= y,
(x: TreeSet[Obj], y: TreeSet[Obj]) = x --= y, Minutes(60), 
Seconds(2), numPartition,
(item: (String, TreeSet[Obj])) = item._2.size != 0)after the 
application  run for an hour,  I kill the application and restart it from 
checkpoint directory, but I  encountered an exception:2015-04-27 17:52:40,955 
INFO  [Driver] - Slicing from 1430126222000 ms to 1430126222000 ms (aligned to 
1430126222000 ms and 1430126222000 ms)
2015-04-27 17:52:40,958 ERROR [Driver] - User class threw exception: null
java.lang.StackOverflowError
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:813)
at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080)
at sun.misc.URLClassPath.getResource(URLClassPath.java:199)
at java.net.URLClassLoader$1.run(URLClassLoader.java:358)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
at org.apache.spark.rdd.RDD.filter(RDD.scala:303)
at 
org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35)
at 
org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Nicholas Chammas
I like the idea of having design docs be kept up to date and tracked in
git.

If the Apache repo isn't a good fit, perhaps we can have a separate repo
just for design docs? Maybe something like github.com/spark-docs/spark-docs/
?

If there's other stuff we want to track but haven't, perhaps we can
generalize the purpose of the repo a bit and rename it accordingly (e.g.
spark-misc/spark-misc).

Nick

On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com wrote:

 My only issue with Google Docs is that they're mutable, so it's difficult
 to follow a design's history through its revisions and link up JIRA
 comments with the relevant version.

 -Sandy

 On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com
 wrote:

 
  One thing to consider is that while docs as PDFs in JIRAs do document the
  original proposal, that's not the place to keep living specifications.
 That
  stuff needs to live in SCM, in a format which can be easily maintained,
 can
  generate readable documents, and, in an unrealistically ideal world, even
  be used by machines to validate compliance with the design. Test suites
  tend to be the implicit machine-readable part of the specification,
 though
  they aren't usually viewed as such.
 
  PDFs of word docs in JIRAs are not the place for ongoing work, even if
 the
  early drafts can contain them. Given it's just as easy to point to
 markdown
  docs in github by commit ID, that could be an alternative way to publish
  docs, with the document itself being viewed as one of the deliverables.
  When the time comes to update a document, then its there in the source
 tree
  to edit.
 
  If there's a flaw here, its that design docs are that: the design. The
  implementation may not match, ongoing work will certainly diverge. If the
  design docs aren't kept in sync, then they can mislead people.
 Accordingly,
  once the design docs are incorporated into the source tree, keeping them
 in
  sync with changes has be viewed as essential as keeping tests up to date
 
   On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote:
  
   I actually don't totally see why we can't use Google Docs provided it
   is clearly discoverable from the JIRA. It was my understanding that
   many projects do this. Maybe not (?).
  
   If it's a matter of maintaining public record on ASF infrastructure,
   perhaps we can just automate that if an issue is closed we capture the
   doc content and attach it to the JIRA as a PDF.
  
   My sense is that in general the ASF infrastructure policy is becoming
   more and more lenient with regards to using third party services,
   provided the are broadly accessible (such as a public google doc) and
   can be definitively archived on ASF controlled storage.
  
   - Patrick
  
   On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote:
   I know I recently used Google Docs from a JIRA, so am guilty as
   charged. I don't think there are a lot of design docs in general, but
   the ones I've seen have simply pushed docs to a JIRA. (I did the same,
   mirroring PDFs of the Google Doc.) I don't think this is hard to
   follow.
  
   I think you can do what you like: make a JIRA and attach files. Make a
   WIP PR and attach your notes. Make a Google Doc if you're feeling
   transgressive.
  
   I don't see much of a problem to solve here. In practice there are
   plenty of workable options, all of which are mainstream, and so I do
   not see an argument that somehow this is solved by letting people make
   wikis.
  
   On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
   punya.bis...@gmail.com wrote:
   Okay, I can understand wanting to keep Git history clean, and avoid
   bottlenecking on committers. Is it reasonable to establish a
  convention of
   having a label, component or (best of all) an issue type for issues
  that are
   associated with design docs? For example, if we used the existing
   Brainstorming issue type, and people put their design doc in the
   description of the ticket, it would be relatively easy to figure out
  what
   designs are in progress.
  
   Given the push-back against design docs in Git or on the wiki and the
  strong
   preference for keeping docs on ASF property, I'm a bit surprised that
  all
   the existing design docs are on Google Docs. Perhaps Apache should
  consider
   opening up parts of the wiki to a larger group, to better serve this
  use
   case.
  
   Punya
  
   On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com
  wrote:
  
   Using our ASF git repository as a working area for design docs, it
   seems potentially concerning to me. It's difficult process wise
   because all commits need to go through committers and also, we'd
   pollute our git history a lot with random incremental design
 updates.
  
   The git history is used a lot by downstream packagers, us during our
   QA process, etc... we really try to keep it oriented around code
   patches:
  
   

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Punyashloka Biswal
Nick, I like your idea of keeping it in a separate git repository. It seems
to combine the advantages of the present Google Docs approach with the
crisper history, discoverability, and text format simplicity of GitHub
wikis.

Punya
On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I like the idea of having design docs be kept up to date and tracked in
 git.

 If the Apache repo isn't a good fit, perhaps we can have a separate repo
 just for design docs? Maybe something like
 github.com/spark-docs/spark-docs/
 ?

 If there's other stuff we want to track but haven't, perhaps we can
 generalize the purpose of the repo a bit and rename it accordingly (e.g.
 spark-misc/spark-misc).

 Nick

 On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com
 wrote:

  My only issue with Google Docs is that they're mutable, so it's difficult
  to follow a design's history through its revisions and link up JIRA
  comments with the relevant version.
 
  -Sandy
 
  On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com
  wrote:
 
  
   One thing to consider is that while docs as PDFs in JIRAs do document
 the
   original proposal, that's not the place to keep living specifications.
  That
   stuff needs to live in SCM, in a format which can be easily maintained,
  can
   generate readable documents, and, in an unrealistically ideal world,
 even
   be used by machines to validate compliance with the design. Test suites
   tend to be the implicit machine-readable part of the specification,
  though
   they aren't usually viewed as such.
  
   PDFs of word docs in JIRAs are not the place for ongoing work, even if
  the
   early drafts can contain them. Given it's just as easy to point to
  markdown
   docs in github by commit ID, that could be an alternative way to
 publish
   docs, with the document itself being viewed as one of the deliverables.
   When the time comes to update a document, then its there in the source
  tree
   to edit.
  
   If there's a flaw here, its that design docs are that: the design. The
   implementation may not match, ongoing work will certainly diverge. If
 the
   design docs aren't kept in sync, then they can mislead people.
  Accordingly,
   once the design docs are incorporated into the source tree, keeping
 them
  in
   sync with changes has be viewed as essential as keeping tests up to
 date
  
On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com
 wrote:
   
I actually don't totally see why we can't use Google Docs provided it
is clearly discoverable from the JIRA. It was my understanding that
many projects do this. Maybe not (?).
   
If it's a matter of maintaining public record on ASF infrastructure,
perhaps we can just automate that if an issue is closed we capture
 the
doc content and attach it to the JIRA as a PDF.
   
My sense is that in general the ASF infrastructure policy is becoming
more and more lenient with regards to using third party services,
provided the are broadly accessible (such as a public google doc) and
can be definitively archived on ASF controlled storage.
   
- Patrick
   
On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com
 wrote:
I know I recently used Google Docs from a JIRA, so am guilty as
charged. I don't think there are a lot of design docs in general,
 but
the ones I've seen have simply pushed docs to a JIRA. (I did the
 same,
mirroring PDFs of the Google Doc.) I don't think this is hard to
follow.
   
I think you can do what you like: make a JIRA and attach files.
 Make a
WIP PR and attach your notes. Make a Google Doc if you're feeling
transgressive.
   
I don't see much of a problem to solve here. In practice there are
plenty of workable options, all of which are mainstream, and so I do
not see an argument that somehow this is solved by letting people
 make
wikis.
   
On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
punya.bis...@gmail.com wrote:
Okay, I can understand wanting to keep Git history clean, and avoid
bottlenecking on committers. Is it reasonable to establish a
   convention of
having a label, component or (best of all) an issue type for issues
   that are
associated with design docs? For example, if we used the existing
Brainstorming issue type, and people put their design doc in the
description of the ticket, it would be relatively easy to figure
 out
   what
designs are in progress.
   
Given the push-back against design docs in Git or on the wiki and
 the
   strong
preference for keeping docs on ASF property, I'm a bit surprised
 that
   all
the existing design docs are on Google Docs. Perhaps Apache should
   consider
opening up parts of the wiki to a larger group, to better serve
 this
   use
case.
   
Punya
   
On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell 
 pwend...@gmail.com
   wrote:
   

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Sandy Ryza
My only issue with Google Docs is that they're mutable, so it's difficult
to follow a design's history through its revisions and link up JIRA
comments with the relevant version.

-Sandy

On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com
wrote:


 One thing to consider is that while docs as PDFs in JIRAs do document the
 original proposal, that's not the place to keep living specifications. That
 stuff needs to live in SCM, in a format which can be easily maintained, can
 generate readable documents, and, in an unrealistically ideal world, even
 be used by machines to validate compliance with the design. Test suites
 tend to be the implicit machine-readable part of the specification, though
 they aren't usually viewed as such.

 PDFs of word docs in JIRAs are not the place for ongoing work, even if the
 early drafts can contain them. Given it's just as easy to point to markdown
 docs in github by commit ID, that could be an alternative way to publish
 docs, with the document itself being viewed as one of the deliverables.
 When the time comes to update a document, then its there in the source tree
 to edit.

 If there's a flaw here, its that design docs are that: the design. The
 implementation may not match, ongoing work will certainly diverge. If the
 design docs aren't kept in sync, then they can mislead people. Accordingly,
 once the design docs are incorporated into the source tree, keeping them in
 sync with changes has be viewed as essential as keeping tests up to date

  On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote:
 
  I actually don't totally see why we can't use Google Docs provided it
  is clearly discoverable from the JIRA. It was my understanding that
  many projects do this. Maybe not (?).
 
  If it's a matter of maintaining public record on ASF infrastructure,
  perhaps we can just automate that if an issue is closed we capture the
  doc content and attach it to the JIRA as a PDF.
 
  My sense is that in general the ASF infrastructure policy is becoming
  more and more lenient with regards to using third party services,
  provided the are broadly accessible (such as a public google doc) and
  can be definitively archived on ASF controlled storage.
 
  - Patrick
 
  On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote:
  I know I recently used Google Docs from a JIRA, so am guilty as
  charged. I don't think there are a lot of design docs in general, but
  the ones I've seen have simply pushed docs to a JIRA. (I did the same,
  mirroring PDFs of the Google Doc.) I don't think this is hard to
  follow.
 
  I think you can do what you like: make a JIRA and attach files. Make a
  WIP PR and attach your notes. Make a Google Doc if you're feeling
  transgressive.
 
  I don't see much of a problem to solve here. In practice there are
  plenty of workable options, all of which are mainstream, and so I do
  not see an argument that somehow this is solved by letting people make
  wikis.
 
  On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
  punya.bis...@gmail.com wrote:
  Okay, I can understand wanting to keep Git history clean, and avoid
  bottlenecking on committers. Is it reasonable to establish a
 convention of
  having a label, component or (best of all) an issue type for issues
 that are
  associated with design docs? For example, if we used the existing
  Brainstorming issue type, and people put their design doc in the
  description of the ticket, it would be relatively easy to figure out
 what
  designs are in progress.
 
  Given the push-back against design docs in Git or on the wiki and the
 strong
  preference for keeping docs on ASF property, I'm a bit surprised that
 all
  the existing design docs are on Google Docs. Perhaps Apache should
 consider
  opening up parts of the wiki to a larger group, to better serve this
 use
  case.
 
  Punya
 
  On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com
 wrote:
 
  Using our ASF git repository as a working area for design docs, it
  seems potentially concerning to me. It's difficult process wise
  because all commits need to go through committers and also, we'd
  pollute our git history a lot with random incremental design updates.
 
  The git history is used a lot by downstream packagers, us during our
  QA process, etc... we really try to keep it oriented around code
  patches:
 
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog
 
  Committing a polished design doc along with a feature, maybe that's
  something we could consider. But I still think JIRA is the best
  location for these docs, consistent with what most other ASF projects
  do that I know.
 
  On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger c...@koeninger.org
  wrote:
  Why can't pull requests be used for design docs in Git if people who
  aren't
  committers want to contribute changes (as opposed to just comments)?
 
  On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen so...@cloudera.com
 wrote:
 
 

Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
sure, i'll kill all of the current spark prb build...

On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote:

 Shane - can we purge all the outstanding builds so we are not running
 stuff against stale PRs?


 On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 And unfortunately, many Jenkins executor slots are being taken by stale
 Spark PRs...

 On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote:

  anyways, the build queue is SLAMMED...  we're going to need at least a
 day
  to catch up w/this.  i'll be keeping an eye on system loads and whatnot
 all
  day today.
 
  whee!
 
  On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu
 wrote:
 
   somehow, the power outage on friday caused the pull request builder to
   lose it's config entirely...  i'm not sure why, but after i added the
  oauth
   token back, we're now catching up on the weekend's pull request
 builds.
  
   have i mentioned how much i hate this plugin?  ;)
  
   sorry for the inconvenience...
  
   shane
  
 





Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
never mind, looks like you guys are already on it.  :)

On Mon, Apr 27, 2015 at 11:35 AM, shane knapp skn...@berkeley.edu wrote:

 sure, i'll kill all of the current spark prb build...

 On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin r...@databricks.com wrote:

 Shane - can we purge all the outstanding builds so we are not running
 stuff against stale PRs?


 On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 And unfortunately, many Jenkins executor slots are being taken by stale
 Spark PRs...

 On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote:

  anyways, the build queue is SLAMMED...  we're going to need at least a
 day
  to catch up w/this.  i'll be keeping an eye on system loads and
 whatnot all
  day today.
 
  whee!
 
  On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu
 wrote:
 
   somehow, the power outage on friday caused the pull request builder
 to
   lose it's config entirely...  i'm not sure why, but after i added the
  oauth
   token back, we're now catching up on the weekend's pull request
 builds.
  
   have i mentioned how much i hate this plugin?  ;)
  
   sorry for the inconvenience...
  
   shane
  
 






Plans for upgrading Hive dependency?

2015-04-27 Thread Punyashloka Biswal
Dear Spark devs,

Is there a plan for staying up-to-date with current (and future) versions
of Hive? Spark currently supports version 0.13 (June 2014), but the latest
version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about
updating beyond 0.13, so I was wondering if this was intentional or it was
just that nobody had started work on this yet.

I'd be happy to work on a PR for the upgrade if one of the core developers
can tell me what pitfalls to watch out for.

Punya


Re: Design docs: consolidation and discoverability

2015-04-27 Thread Punyashloka Biswal
Github's wiki is just another Git repo. If we use a separate repo, it's
probably easiest to use the wiki git repo rather than the primary git
repo.

Punya

On Mon, Apr 27, 2015 at 1:50 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Oh, a GitHub wiki (which is separate from having docs in a repo) is yet
 another approach we could take, though if we want to do that on the main
 Spark repo we'd need permission from Apache, which may be tough to get...

 On Mon, Apr 27, 2015 at 1:47 PM Punyashloka Biswal punya.bis...@gmail.com
 wrote:

 Nick, I like your idea of keeping it in a separate git repository. It
 seems to combine the advantages of the present Google Docs approach with
 the crisper history, discoverability, and text format simplicity of GitHub
 wikis.

 Punya
 On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I like the idea of having design docs be kept up to date and tracked in
 git.

 If the Apache repo isn't a good fit, perhaps we can have a separate repo
 just for design docs? Maybe something like
 github.com/spark-docs/spark-docs/
 ?

 If there's other stuff we want to track but haven't, perhaps we can
 generalize the purpose of the repo a bit and rename it accordingly (e.g.
 spark-misc/spark-misc).

 Nick

 On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com
 wrote:

  My only issue with Google Docs is that they're mutable, so it's
 difficult
  to follow a design's history through its revisions and link up JIRA
  comments with the relevant version.
 
  -Sandy
 
  On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran 
 ste...@hortonworks.com
  wrote:
 
  
   One thing to consider is that while docs as PDFs in JIRAs do
 document the
   original proposal, that's not the place to keep living
 specifications.
  That
   stuff needs to live in SCM, in a format which can be easily
 maintained,
  can
   generate readable documents, and, in an unrealistically ideal world,
 even
   be used by machines to validate compliance with the design. Test
 suites
   tend to be the implicit machine-readable part of the specification,
  though
   they aren't usually viewed as such.
  
   PDFs of word docs in JIRAs are not the place for ongoing work, even
 if
  the
   early drafts can contain them. Given it's just as easy to point to
  markdown
   docs in github by commit ID, that could be an alternative way to
 publish
   docs, with the document itself being viewed as one of the
 deliverables.
   When the time comes to update a document, then its there in the
 source
  tree
   to edit.
  
   If there's a flaw here, its that design docs are that: the design.
 The
   implementation may not match, ongoing work will certainly diverge.
 If the
   design docs aren't kept in sync, then they can mislead people.
  Accordingly,
   once the design docs are incorporated into the source tree, keeping
 them
  in
   sync with changes has be viewed as essential as keeping tests up to
 date
  
On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com
 wrote:
   
I actually don't totally see why we can't use Google Docs provided
 it
is clearly discoverable from the JIRA. It was my understanding that
many projects do this. Maybe not (?).
   
If it's a matter of maintaining public record on ASF
 infrastructure,
perhaps we can just automate that if an issue is closed we capture
 the
doc content and attach it to the JIRA as a PDF.
   
My sense is that in general the ASF infrastructure policy is
 becoming
more and more lenient with regards to using third party services,
provided the are broadly accessible (such as a public google doc)
 and
can be definitively archived on ASF controlled storage.
   
- Patrick
   
On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com
 wrote:
I know I recently used Google Docs from a JIRA, so am guilty as
charged. I don't think there are a lot of design docs in general,
 but
the ones I've seen have simply pushed docs to a JIRA. (I did the
 same,
mirroring PDFs of the Google Doc.) I don't think this is hard to
follow.
   
I think you can do what you like: make a JIRA and attach files.
 Make a
WIP PR and attach your notes. Make a Google Doc if you're feeling
transgressive.
   
I don't see much of a problem to solve here. In practice there are
plenty of workable options, all of which are mainstream, and so I
 do
not see an argument that somehow this is solved by letting people
 make
wikis.
   
On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
punya.bis...@gmail.com wrote:
Okay, I can understand wanting to keep Git history clean, and
 avoid
bottlenecking on committers. Is it reasonable to establish a
   convention of
having a label, component or (best of all) an issue type for
 issues
   that are
associated with design docs? For example, if we used the existing
Brainstorming issue type, and people put their design doc 

Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread Reynold Xin
Shane - can we purge all the outstanding builds so we are not running stuff
against stale PRs?


On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 And unfortunately, many Jenkins executor slots are being taken by stale
 Spark PRs...

 On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote:

  anyways, the build queue is SLAMMED...  we're going to need at least a
 day
  to catch up w/this.  i'll be keeping an eye on system loads and whatnot
 all
  day today.
 
  whee!
 
  On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu
 wrote:
 
   somehow, the power outage on friday caused the pull request builder to
   lose it's config entirely...  i'm not sure why, but after i added the
  oauth
   token back, we're now catching up on the weekend's pull request builds.
  
   have i mentioned how much i hate this plugin?  ;)
  
   sorry for the inconvenience...
  
   shane
  
 



github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
somehow, the power outage on friday caused the pull request builder to lose
it's config entirely...  i'm not sure why, but after i added the oauth
token back, we're now catching up on the weekend's pull request builds.

have i mentioned how much i hate this plugin?  ;)

sorry for the inconvenience...

shane


Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
anyways, the build queue is SLAMMED...  we're going to need at least a day
to catch up w/this.  i'll be keeping an eye on system loads and whatnot all
day today.

whee!

On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote:

 somehow, the power outage on friday caused the pull request builder to
 lose it's config entirely...  i'm not sure why, but after i added the oauth
 token back, we're now catching up on the weekend's pull request builds.

 have i mentioned how much i hate this plugin?  ;)

 sorry for the inconvenience...

 shane



Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread Nicholas Chammas
And unfortunately, many Jenkins executor slots are being taken by stale
Spark PRs...

On Mon, Apr 27, 2015 at 2:25 PM shane knapp skn...@berkeley.edu wrote:

 anyways, the build queue is SLAMMED...  we're going to need at least a day
 to catch up w/this.  i'll be keeping an eye on system loads and whatnot all
 day today.

 whee!

 On Mon, Apr 27, 2015 at 11:18 AM, shane knapp skn...@berkeley.edu wrote:

  somehow, the power outage on friday caused the pull request builder to
  lose it's config entirely...  i'm not sure why, but after i added the
 oauth
  token back, we're now catching up on the weekend's pull request builds.
 
  have i mentioned how much i hate this plugin?  ;)
 
  sorry for the inconvenience...
 
  shane