Re: [SPARK-24865] Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin
A work-in-progress PR: https://github.com/apache/spark/pull/21822 The PR also adds the infrastructure to throw exceptions in test mode when the various transform methods are used as part of analysis. Unfortunately there are couple edge cases that do need that, and as a result there is this ugly

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Saisai Shao
Sure, I can wait for this and create another RC then. Thanks, Saisai Xiao Li 于2018年7月20日周五 上午9:11写道: > Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I > created. The PR has been created. Since this is not rare, let us merge it > to 2.3.2? > > Reynold' PR is to get rid of

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Xiao Li
Yes. https://issues.apache.org/jira/browse/SPARK-24867 is the one I created. The PR has been created. Since this is not rare, let us merge it to 2.3.2? Reynold' PR is to get rid of AnalysisBarrier. That is better than multiple patches we added for AnalysisBarrier after 2.3.0 release. We can

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Saisai Shao
I see, thanks Reynold. Reynold Xin 于2018年7月20日周五 上午8:46写道: > Looking at the list of pull requests it looks like this is the ticket: > https://issues.apache.org/jira/browse/SPARK-24867 > > > > On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin wrote: > >> I don't think my ticket should block this

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Reynold Xin
Looking at the list of pull requests it looks like this is the ticket: https://issues.apache.org/jira/browse/SPARK-24867 On Thu, Jul 19, 2018 at 5:25 PM Reynold Xin wrote: > I don't think my ticket should block this release. It's a big general > refactoring. > > Xiao do you have a ticket for

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Reynold Xin
I don't think my ticket should block this release. It's a big general refactoring. Xiao do you have a ticket for the bug you found? On Thu, Jul 19, 2018 at 5:24 PM Saisai Shao wrote: > Hi Xiao, > > Are you referring to this JIRA ( > https://issues.apache.org/jira/browse/SPARK-24865)? > > Xiao

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Saisai Shao
Hi Xiao, Are you referring to this JIRA ( https://issues.apache.org/jira/browse/SPARK-24865)? Xiao Li 于2018年7月20日周五 上午2:41写道: > dfWithUDF.cache() > dfWithUDF.write.saveAsTable("t") > dfWithUDF.write.saveAsTable("t1") > > > Cached data is not being used. It causes a big performance regression.

[SPARK-24865] Remove AnalysisBarrier

2018-07-19 Thread Reynold Xin
We have had multiple bugs introduced by AnalysisBarrier. In hindsight I think the original design before analysis barrier was much simpler and requires less developer knowledge of the infrastructure. As long as analysis barrier is there, developers writing various code in analyzer will have to be

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
Yeah, I was mostly thinking that, if the normal Spark PR tests were setup to check the sigs (every time? some of the time?), then this could serve as an automatic check that nothing funny has been done to the archives. There shouldn't be any difference between the cache and the archive; but if

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Sean Owen
Yeah if the test code keeps around the archive and/or digest of what it unpacked. A release should never be modified though, so highly rare. If the worry is hacked mirrors then we might have bigger problems, but there the issue is verifying the download sigs in the first place. Those would have

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
Is there or should there be some checking of digests just to make sure that we are really testing against the same thing in /tmp/test-spark that we are distributing from the archive? On Thu, Jul 19, 2018 at 11:15 AM Sean Owen wrote: > Ideally, that list is updated with each release, yes.

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Xiao Li
dfWithUDF.cache() dfWithUDF.write.saveAsTable("t") dfWithUDF.write.saveAsTable("t1") Cached data is not being used. It causes a big performance regression. 2018-07-19 11:32 GMT-07:00 Sean Owen : > What regression are you referring to here? A -1 vote really needs a > rationale. > > On Thu,

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Sean Owen
What regression are you referring to here? A -1 vote really needs a rationale. On Thu, Jul 19, 2018 at 1:27 PM Xiao Li wrote: > I would first vote -1. > > I might find another regression caused by the analysis barrier. Will keep > you posted. > >

Re: [VOTE] SPARK 2.3.2 (RC3)

2018-07-19 Thread Xiao Li
I would first vote -1. I might find another regression caused by the analysis barrier. Will keep you posted. Xiao 2018-07-18 18:05 GMT-07:00 Takeshi Yamamuro : > +1 (non-binding) > > I run tests on a EC2 m4.2xlarge instance; > [ec2-user]$ java -version > openjdk version "1.8.0_171" > OpenJDK

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Sean Owen
Ideally, that list is updated with each release, yes. Non-current releases will now always download from archive.apache.org though. But we run into rate-limiting problems if that gets pinged too much. So yes good to keep the list only to current branches. It looks like the download is cached in

Compute /Storage Calculation

2018-07-19 Thread Deepu Raj
Hi Team - Any good calculator/Excel to estimate compute and storage requirements for the new spark jobs to be developed. Capacity planning based on:- Job, Data type etc Thanks, Deepu Raj

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Felix Cheung
+1 this has been problematic. Also, this list needs to be updated every time we make a new release? Plus can we cache them on Jenkins, maybe we can avoid downloading the same thing from Apache archive every test run. From: Marco Gaido Sent: Monday, July 16,

[STRUCTURED STREAM] Join static dataset in state function (flatMapGroupsWithState)

2018-07-19 Thread Christiaan Ras
I use the state function flatmapgroupswithstate to track state of a kafka stream. To further customize the state function I like to use a static datasource (JDBC) in the state function. This datasource contains data I like to join with the stream (as Iterator) within flatmapgroupswithstate.