To evaluate the amount of work required to get Data Source V2 into Spark 3.0, 
we should have a list of all the specific SPIPs and patches that are pending 
that would constitute a successful and usable revamp of that API. Here are the 
ones I could find and know off the top of my head:
Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252
In my opinion this is by far the most important API to get in, but it’s also 
the most important API to give thorough thought and evaluation.
Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: 
https://issues.apache.org/jira/browse/SPARK-24923 + 
https://issues.apache.org/jira/browse/SPARK-24253
Catalogs for other entities, such as functions. Pluggable system for loading 
these.
Multi-Catalog support - https://issues.apache.org/jira/browse/SPARK-25006
Migration of existing sources to V2, particularly file sources like Parquet and 
ORC – requires #1 as discussed in yesterday’s meeting
 

Can someone add to this list if we’re missing anything? It might also make 
sense to either assigned a JIRA label or to update JIRA umbrella issues if any. 
Whatever mechanism works for being able to find all of these outstanding issues 
in one place.

 

My understanding is that #1 is the most critical feature we need, and the 
feature that will go a long way towards allowing everything else to fall into 
place. #2 is also critical for external implementations of Data Source V2. I 
think we can afford to defer 3-5 to a future point release. But #1 and #2 are 
also the features that have remained open for the longest time and we really 
need to move forward on these. Putting a target release for 3.0 will help in 
that regard.

 

-Matt Cheah

 

From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: "rb...@netflix.com" <rb...@netflix.com>
Date: Thursday, February 21, 2019 at 2:22 PM
To: Matei Zaharia <matei.zaha...@gmail.com>
Cc: Spark Dev List <dev@spark.apache.org>
Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2

 

I'm all for making releases more often if we want. But this work could really 
use a target release to motivate getting it done. If we agree that it will 
block a release, then everyone is motivated to review and get the PRs in. 

 

If this work doesn't make it in the 3.0 release, I'm not confident that it will 
get done. Maybe we can have a release shortly after, but the timeline for these 
features -- that many of us need -- is nearly creeping into years. That's when 
alternatives start looking more likely to deliver. I'd rather see this work get 
in so we don't have to consider those alternatives, which is why I think this 
commitment is a good idea.

 

I also would like to see multi-catalog support, but that is more reasonable to 
put off for a follow-up feature release, maybe 3.1.

 

On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia <matei.zaha...@gmail.com> wrote:

How large would the delay be? My 2 cents are that there’s nothing stopping us 
from making feature releases more often if we want to, so we shouldn’t see this 
as an “either delay 3.0 or release in >6 months” decision. If the work is 
likely to get in with a small delay and simplifies our work after 3.0 (e.g. we 
can get rid of older APIs), then the delay may be worth it. But if it would be 
a large delay, we should also weigh it against other things that are going to 
get delayed if 3.0 moves much later.

It might also be better to propose a specific date to delay until, so people 
can still plan around when the release branch will likely be cut.

Matei

> On Feb 21, 2019, at 1:03 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Hi everyone,
> 
> In the DSv2 sync last night, we had a discussion about roadmap and what the 
> goal should be for getting the main features into Spark. We all agreed that 
> 3.0 should be that goal, even if it means delaying the 3.0 release.
> 
> The possibility of delaying the 3.0 release may be controversial, so I want 
> to bring it up to the dev list to build consensus around it. The rationale 
> for this is partly that much of this work has been outstanding for more than 
> a year now. If it doesn't make it into 3.0, then it would be another 6 months 
> before it would be in a release, and would be nearing 2 years to get the work 
> done.
> 
> Are there any objections to targeting 3.0 for this?
> 
> In addition, much of the planning for multi-catalog support has been done to 
> make v2 possible. Do we also want to include multi-catalog support?
> 
> 
> rb
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


 

-- 

Ryan Blue 

Software Engineer

Netflix

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to