To evaluate the amount of work required to get Data Source V2 into Spark 3.0, we should have a list of all the specific SPIPs and patches that are pending that would constitute a successful and usable revamp of that API. Here are the ones I could find and know off the top of my head: Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252 In my opinion this is by far the most important API to get in, but it’s also the most important API to give thorough thought and evaluation. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE: https://issues.apache.org/jira/browse/SPARK-24923 + https://issues.apache.org/jira/browse/SPARK-24253 Catalogs for other entities, such as functions. Pluggable system for loading these. Multi-Catalog support - https://issues.apache.org/jira/browse/SPARK-25006 Migration of existing sources to V2, particularly file sources like Parquet and ORC – requires #1 as discussed in yesterday’s meeting
Can someone add to this list if we’re missing anything? It might also make sense to either assigned a JIRA label or to update JIRA umbrella issues if any. Whatever mechanism works for being able to find all of these outstanding issues in one place. My understanding is that #1 is the most critical feature we need, and the feature that will go a long way towards allowing everything else to fall into place. #2 is also critical for external implementations of Data Source V2. I think we can afford to defer 3-5 to a future point release. But #1 and #2 are also the features that have remained open for the longest time and we really need to move forward on these. Putting a target release for 3.0 will help in that regard. -Matt Cheah From: Ryan Blue <rb...@netflix.com.INVALID> Reply-To: "rb...@netflix.com" <rb...@netflix.com> Date: Thursday, February 21, 2019 at 2:22 PM To: Matei Zaharia <matei.zaha...@gmail.com> Cc: Spark Dev List <dev@spark.apache.org> Subject: Re: [DISCUSS] Spark 3.0 and DataSourceV2 I'm all for making releases more often if we want. But this work could really use a target release to motivate getting it done. If we agree that it will block a release, then everyone is motivated to review and get the PRs in. If this work doesn't make it in the 3.0 release, I'm not confident that it will get done. Maybe we can have a release shortly after, but the timeline for these features -- that many of us need -- is nearly creeping into years. That's when alternatives start looking more likely to deliver. I'd rather see this work get in so we don't have to consider those alternatives, which is why I think this commitment is a good idea. I also would like to see multi-catalog support, but that is more reasonable to put off for a follow-up feature release, maybe 3.1. On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia <matei.zaha...@gmail.com> wrote: How large would the delay be? My 2 cents are that there’s nothing stopping us from making feature releases more often if we want to, so we shouldn’t see this as an “either delay 3.0 or release in >6 months” decision. If the work is likely to get in with a small delay and simplifies our work after 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it. But if it would be a large delay, we should also weigh it against other things that are going to get delayed if 3.0 moves much later. It might also be better to propose a specific date to delay until, so people can still plan around when the release branch will likely be cut. Matei > On Feb 21, 2019, at 1:03 PM, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Hi everyone, > > In the DSv2 sync last night, we had a discussion about roadmap and what the > goal should be for getting the main features into Spark. We all agreed that > 3.0 should be that goal, even if it means delaying the 3.0 release. > > The possibility of delaying the 3.0 release may be controversial, so I want > to bring it up to the dev list to build consensus around it. The rationale > for this is partly that much of this work has been outstanding for more than > a year now. If it doesn't make it into 3.0, then it would be another 6 months > before it would be in a release, and would be nearing 2 years to get the work > done. > > Are there any objections to targeting 3.0 for this? > > In addition, much of the planning for multi-catalog support has been done to > make v2 possible. Do we also want to include multi-catalog support? > > > rb > > -- > Ryan Blue > Software Engineer > Netflix -- Ryan Blue Software Engineer Netflix
smime.p7s
Description: S/MIME cryptographic signature