Help required in validating an architecture using Structured Streaming

2016-09-27 Thread Aravindh
Hi, We are building an internal analytics application. Kind of an event store. We have all the basic analytics use cases like filtering, aggregation, segmentation etc. So far our architecture used ElasticSearch extensively but that is not scaling anymore. One unique requirement we have is an event

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Felix Cheung
+1 on longer release cycle at schedule and more maintenance releases. _ From: Mark Hamstra > Sent: Tuesday, September 27, 2016 2:01 PM Subject: Re: [discuss] Spark 2.x release cadence To: Reynold Xin

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Reynold Xin
So technically the vote has passed, but IMHO it does not make sense to release this and then immediately release 2.0.2. I will work on a new RC once SPARK-17666 and SPARK-17673 are fixed. Please shout if you disagree. On Tue, Sep 27, 2016 at 2:05 PM, Mark Hamstra

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Mark Hamstra
If we're going to cut another RC, then it would be good to get this in as well (assuming that it is merged shortly): https://github.com/apache/spark/pull/15213 It's not a regression, and it shouldn't happen too often, but when failed stages don't get resubmitted it is a fairly significant issue.

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Mark Hamstra
+1 And I'll dare say that for those with Spark in production, what is more important is that maintenance releases come out in a timely fashion than that new features are released one month sooner or later. On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin wrote: > We are 2

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Reynold Xin
Actually I'm going to have to -1 the release myself. Sorry for crashing the party, but I saw two super critical issues discovered in the last 2 days: https://issues.apache.org/jira/browse/SPARK-17666 -- this would eventually hang Spark when running against S3 (and many other storage systems)

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Sean Owen
+1 -- I think the minor releases were taking more like 4 months than 3 months anyway, and it was good for the reasons you give. This reflects reality and is a good thing. All the better if we then can more comfortably really follow the timeline. On Tue, Sep 27, 2016 at 3:06 PM, Reynold Xin

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Shivaram Venkataraman
+1 I think having a 4 month window instead of a 3 month window sounds good. However I think figuring out a timeline for maintenance releases would also be good. This is a common concern that comes up in many user threads and it'll be better to have some structure around this. It doesn't need to

[discuss] Spark 2.x release cadence

2016-09-27 Thread Reynold Xin
We are 2 months past releasing Spark 2.0.0, an important milestone for the project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we had for the 1.x line, and we never explicitly discussed what the release cadence should look like for 2.x. Thus this email. During Spark 1.x,

Re: https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread Herman van Hövell tot Westerflier
Hi Asaf, The current collect_list/collect_set implementations have room for improvement. We did not implement partial aggregation for these, because the idea of a partial aggregation is that we can reduce network traffic (by shipping fewer partially aggregated buffers); this does not really apply

https://issues.apache.org/jira/browse/SPARK-17691

2016-09-27 Thread assaf.mendelson
Hi, I wanted to try to implement https://issues.apache.org/jira/browse/SPARK-17691. So I started by looking at the implementation of collect_list. My idea was, do the same as they but when adding a new element, if there are already more than the threshold, remove one instead. The problem with

Re: Should LeafExpression have children final override (like Nondeterministic)?

2016-09-27 Thread Reynold Xin
Yes - same thing with children in UnaryExpression, BinaryExpression. Although I have to say the utility isn't that big here. On Tue, Sep 27, 2016 at 12:53 AM, Jacek Laskowski wrote: > Hi, > > Perhaps nitpicking...you've been warned. > > While reviewing expressions in Catalyst

Should LeafExpression have children final override (like Nondeterministic)?

2016-09-27 Thread Jacek Laskowski
Hi, Perhaps nitpicking...you've been warned. While reviewing expressions in Catalyst I've noticed some inconsistency, i.e. Nondeterministic trait has two methods deterministic and foldable final override while LeafExpression does not have children final (at the very least). My thinking is that

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Suresh Thalamati
+1 (non-binding) -suresh > On Sep 26, 2016, at 11:11 PM, Jagadeesan As wrote: > > +1 (non binding) > > Cheers, > Jagadeesan A S > > > > > From:Jean-Baptiste Onofré > To:dev@spark.apache.org > Date:27-09-16 11:27 AM > Subject: