Spark streaming get RDD within the sliding window

2016-08-24 Thread Ulanov, Alexander
Dear Spark developers, I am working with Spark streaming 1.6.1. The task is to get RDDs for some external analytics from each timewindow. This external function accepts RDD so I cannot use DStream. I learned that DStream.window.compute(time) returns Option[RDD]. I am trying to use it in the

StateStore with DStreams

2016-08-24 Thread Matt Smith
Are there any examples of how to use StateStore with DStreams? It seems like the idea would be to create a new version with each minibatch, but I don't quite know how to make that happen. My lame attempt is below. def run (ss: SparkSession): Unit = { val c = new

Re: Tree for SQL Query

2016-08-24 Thread Reynold Xin
It's basically the output of the explain command. On Wed, Aug 24, 2016 at 12:31 PM, Maciej Bryński wrote: > Hi, > I read this article: > https://databricks.com/blog/2015/04/13/deep-dive-into- > spark-sqls-catalyst-optimizer.html > > And I have a question. Is it possible to

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Sean Owen
If you're just varying versions (or things that can be controlled by a profile, which is most everything including dependencies), you don't need and probably don't want multiple POM files. Even that wouldn't mean you can't use classifiers. I have seen it used for HBase, core Hadoop. I am not sure

Re: GraphFrames 0.2.0 released

2016-08-24 Thread Maciej Bryński
Hi, Do you plan to add tag for this release on github ? https://github.com/graphframes/graphframes/releases Regards, Maciek 2016-08-17 3:18 GMT+02:00 Jacek Laskowski : > Hi Tim, > > AWESOME. Thanks a lot for releasing it. That makes me even more eager > to see it in Spark's

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Have you seen any successful applications of this for Spark 1.x/2.x? >From the doc "The classifier allows to distinguish artifacts that were built from the same POM but differ in their content." We'd be building from different POMs, since we'd be modifying the Spark dependency version (and

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Sean Owen
This is also what "classifiers" are for in Maven, to have variations on one artifact and version. https://maven.apache.org/pom.html It has been used to ship code for Hadoop 1 vs 2 APIs. In a way it's the same idea as Scala's "_2.xx" naming convention, with a less unfortunate implementation. On

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-08-24 Thread Reynold Xin
Looks like I'm general people like it. Next step is for somebody to take the lead and implement it. Tom do you have cycles to do this? On Wednesday, August 24, 2016, Tom Graves wrote: > ping, did this discussion conclude or did we decide what we are doing? > > Tom > > >

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Ah yes, thank you for the clarification. On Wed, Aug 24, 2016 at 11:44 AM, Ted Yu wrote: > 'Spark 1.x and Scala 2.10 & 2.11' was repeated. > > I guess your second line should read: > > org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x > and Scala 2.10

Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Ted Yu
'Spark 1.x and Scala 2.10 & 2.11' was repeated. I guess your second line should read: org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1] for Spark 2.x and Scala 2.10 & 2.11 On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer wrote: > Hello, > > We're a project downstream

Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Hello, We're a project downstream of Spark and need to provide separate artifacts for Spark 1.x and Spark 2.x. Has any convention been established or even proposed for artifact names and/or qualifiers? We are currently thinking org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1] for Spark 1.x

Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-24 Thread Michael Allman
FYI, I've updated the issue's description to include a very simple program which reproduces the issue for me. Cheers, Michael > On Aug 23, 2016, at 4:54 PM, Michael Allman wrote: > > I've replied on the issue's page, but in a word, "yes". See >

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-08-24 Thread Tom Graves
ping, did this discussion conclude or did we decide what we are doing? Tom On Friday, May 13, 2016 3:19 PM, Michael Armbrust wrote: +1 to the general structure of Reynold's proposal.  I've found what we do currently a little confusing.  In particular, it

Re: Spark dev-setup

2016-08-24 Thread Jacek Laskowski
On Wed, Aug 24, 2016 at 2:32 PM, Steve Loughran wrote: > no reason; the key thing is : not in cluster mode, as there your work happens > elsewhere Right! Anything but cluster mode should make it easy (that leaves us with local). Jacek

Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread Daniel Darabos
You are saying the RDD lineage must be serialized, otherwise we could not recreate it after a node failure. This is false. The RDD lineage is not serialized. It is only relevant to the driver application and as such it is just kept in memory in the driver application. If the driver application

Re: Spark dev-setup

2016-08-24 Thread Steve Loughran
> On 24 Aug 2016, at 11:38, Jacek Laskowski wrote: > > On Wed, Aug 24, 2016 at 11:13 AM, Steve Loughran > wrote: > >> I'd recommend > > ...which I mostly agree to with some exceptions :) > >> -stark spark standalone from there > > Why spark

Re: Spark dev-setup

2016-08-24 Thread Jacek Laskowski
On Wed, Aug 24, 2016 at 11:13 AM, Steve Loughran wrote: > I'd recommend ...which I mostly agree to with some exceptions :) > -stark spark standalone from there Why spark standalone since the OP asked about "learning how query execution flow occurs in Spark SQL"? How

Re: Spark dev-setup

2016-08-24 Thread Steve Loughran
On 24 Aug 2016, at 07:10, Nishadi Kirielle > wrote: Hi, I'm engaged in learning how query execution flow occurs in Spark SQL. In order to understand the query execution flow, I'm attempting to run an example in debug mode with intellij IDEA. It

Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread kant kodali
can you please elaborate a bit more? On Wed, Aug 24, 2016 12:41 AM, Sean Owen so...@cloudera.com wrote: Byte code, no. It's sufficient to store the information that the RDD represents, which can include serialized function closures, but that's not quite storing byte code. On Wed, Aug 24,

Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread Sean Owen
Byte code, no. It's sufficient to store the information that the RDD represents, which can include serialized function closures, but that's not quite storing byte code. On Wed, Aug 24, 2016 at 2:00 AM, kant kodali wrote: > Hi Guys, > > I have this question for a very long

Re: Spark dev-setup

2016-08-24 Thread Nishadi Kirielle
Hi, I'm engaged in learning how query execution flow occurs in Spark SQL. In order to understand the query execution flow, I'm attempting to run an example in debug mode with intellij IDEA. It would be great if anyone can help me with debug configurations. Thanks & Regards Nishadi On Tue, Jun