Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
In addition to logical plans, we need SQL support. That requires resolving v2 tables from a catalog and a few other changes like separating v1 plans from SQL parsing (see the earlier dev list thread). I’d also like to add DDL operations for v2. I think it also makes sense to add a new DF write

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matt Cheah
To evaluate the amount of work required to get Data Source V2 into Spark 3.0, we should have a list of all the specific SPIPs and patches that are pending that would constitute a successful and usable revamp of that API. Here are the ones I could find and know off the top of my head: Table

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
I'm all for making releases more often if we want. But this work could really use a target release to motivate getting it done. If we agree that it will block a release, then everyone is motivated to review and get the PRs in. If this work doesn't make it in the 3.0 release, I'm not confident

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Matei Zaharia
How large would the delay be? My 2 cents are that there’s nothing stopping us from making feature releases more often if we want to, so we shouldn’t see this as an “either delay 3.0 or release in >6 months” decision. If the work is likely to get in with a small delay and simplifies our work

[DISCUSS] Spark 3.0 and DataSourceV2

2019-02-21 Thread Ryan Blue
Hi everyone, In the DSv2 sync last night, we had a discussion about roadmap and what the goal should be for getting the main features into Spark. We all agreed that 3.0 should be that goal, even if it means delaying the 3.0 release. The possibility of delaying the 3.0 release may be

DataSourceV2 sync notes - 20 Feb 2019

2019-02-21 Thread Ryan Blue
Here are my notes from the DSv2 sync last night. As always, if you have corrections, please reply with them. And if you’d like to be included on the invite to participate in the next sync (6 March), send me an email. Here’s a quick summary of the topics where we had consensus last night: -

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread DB Tsai
I am cutting a new rc4 with fix from Felix. Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0359BC9965359766 On Thu, Feb 21, 2019 at 8:57 AM Felix Cheung wrote: > > I merged the fix to 2.4. > > >

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Felix Cheung
I merged the fix to 2.4. From: Felix Cheung Sent: Wednesday, February 20, 2019 9:34 PM To: DB Tsai; Spark dev list Cc: Cesar Delgado Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2) Could you hold for a bit - I have one more fix to get in

Re: Thoughts on dataframe cogroup?

2019-02-21 Thread Li Jin
I am wondering do other people have opinion/use case on cogroup? On Wed, Feb 20, 2019 at 5:03 PM Li Jin wrote: > Alessandro, > > Thanks for the reply. I assume by "equi-join", you mean "equality full > outer join" . > > Two issues I see with equity outer join is: > (1) equity outer join will

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Sean Owen
That looks like a change to restore some behavior that was removed in 2.2. It's not directly relevant to a release vote on 2.4.1. See the existing discussion at https://github.com/apache/spark/pull/22144#issuecomment-432258536 It may indeed be a good thing to change but just continue the

Re: [VOTE] Release Apache Spark 2.4.1 (RC2)

2019-02-21 Thread Parth Gandhi
Hello, In https://issues.apache.org/jira/browse/SPARK-24935, I am getting requests from people that they were hoping for the fix to be merged in Spark 2.4.1. The concerned PR is here: https://github.com/apache/spark/pull/23778. I do not mind if we do not merge it for 2.4.1 and I do not

Re: [VOTE] SPIP: Identifiers for multi-catalog Spark

2019-02-21 Thread Xiao Li
+1 This is in the right direction. The resolution rules and catalog APIs need more discussion when we implement it. In the current stage, we can disallow the runtime creation of the catalog. This will complicate the name resolution in a multi-session environment. For example, when one user