Re: [DISCUSS] Spark 3.0 and DataSourceV2

Ryan Blue Thu, 21 Feb 2019 16:23:52 -0800

In addition to logical plans, we need SQL support. That requires resolving
v2 tables from a catalog and a few other changes like separating v1 plans
from SQL parsing (see the earlier dev list thread). I’d also like to add
DDL operations for v2.


I think it also makes sense to add a new DF write API, as we discussed in
the sync as well. That way, users have an API to start moving to that
always uses the v2 plans and behavior.

Here are all the commands that we have implemented on top of the proposed
table catalog API. We should be able to get these working in upstream Spark
fairly quickly.

   - CREATE TABLE [IF NOT EXISTS] …
   - CREATE TABLE … PARTITIONED BY …
   - CREATE TABLE … AS SELECT …
   - CREATE TABLE LIKE
   - ALTER TABLE …
      - ADD COLUMNS …
      - DROP COLUMNS …
      - ALTER COLUMN … TYPE
      - ALTER COLUMN … COMMENT
      - RENAME COLUMN … TO …
      - SET TBLPROPERTIES …
      - UNSET TBLPROPERTIES …
   - ALTER TABLE … RENAME TO …
   - DROP TABLE [IF EXISTS] …
   - DESCRIBE [FORMATTED|EXTENDED] …
   - SHOW CREATE TABLE …
   - SHOW TBLPROPERTIES
   - ALTER TABLE
   - REFRESH TABLE …
   - INSERT INTO …
   - INSERT OVERWRITE …
   - DELETE FROM … WHERE …


On Thu, Feb 21, 2019 at 3:57 PM Matt Cheah <[email protected]> wrote:

> To evaluate the amount of work required to get Data Source V2 into Spark
> 3.0, we should have a list of all the specific SPIPs and patches that are
> pending that would constitute a successful and usable revamp of that API.
> Here are the ones I could find and know off the top of my head:
>
>    1. Table Catalog API: https://issues.apache.org/jira/browse/SPARK-24252
>       1. In my opinion this is by far the most important API to get in,
>       but it’s also the most important API to give thorough thought and
>       evaluation.
>    2. Remaining logical plans for CTAS, RTAS, DROP / DELETE, OVERWRITE:
>    https://issues.apache.org/jira/browse/SPARK-24923 +
>    https://issues.apache.org/jira/browse/SPARK-24253
>    3. Catalogs for other entities, such as functions. Pluggable system
>    for loading these.
>    4. Multi-Catalog support -
>    https://issues.apache.org/jira/browse/SPARK-25006
>    5. Migration of existing sources to V2, particularly file sources like
>    Parquet and ORC – requires #1 as discussed in yesterday’s meeting
>
>
>
> Can someone add to this list if we’re missing anything? It might also make
> sense to either assigned a JIRA label or to update JIRA umbrella issues if
> any. Whatever mechanism works for being able to find all of these
> outstanding issues in one place.
>
>
>
> My understanding is that #1 is the most critical feature we need, and the
> feature that will go a long way towards allowing everything else to fall
> into place. #2 is also critical for external implementations of Data Source
> V2. I think we can afford to defer 3-5 to a future point release. But #1
> and #2 are also the features that have remained open for the longest time
> and we really need to move forward on these. Putting a target release for
> 3.0 will help in that regard.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Ryan Blue <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Thursday, February 21, 2019 at 2:22 PM
> *To: *Matei Zaharia <[email protected]>
> *Cc: *Spark Dev List <[email protected]>
> *Subject: *Re: [DISCUSS] Spark 3.0 and DataSourceV2
>
>
>
> I'm all for making releases more often if we want. But this work could
> really use a target release to motivate getting it done. If we agree that
> it will block a release, then everyone is motivated to review and get the
> PRs in.
>
>
>
> If this work doesn't make it in the 3.0 release, I'm not confident that it
> will get done. Maybe we can have a release shortly after, but the timeline
> for these features -- that many of us need -- is nearly creeping into
> years. That's when alternatives start looking more likely to deliver. I'd
> rather see this work get in so we don't have to consider those
> alternatives, which is why I think this commitment is a good idea.
>
>
>
> I also would like to see multi-catalog support, but that is more
> reasonable to put off for a follow-up feature release, maybe 3.1.
>
>
>
> On Thu, Feb 21, 2019 at 1:45 PM Matei Zaharia <[email protected]>
> wrote:
>
> How large would the delay be? My 2 cents are that there’s nothing stopping
> us from making feature releases more often if we want to, so we shouldn’t
> see this as an “either delay 3.0 or release in >6 months” decision. If the
> work is likely to get in with a small delay and simplifies our work after
> 3.0 (e.g. we can get rid of older APIs), then the delay may be worth it.
> But if it would be a large delay, we should also weigh it against other
> things that are going to get delayed if 3.0 moves much later.
>
> It might also be better to propose a specific date to delay until, so
> people can still plan around when the release branch will likely be cut.
>
> Matei
>
> > On Feb 21, 2019, at 1:03 PM, Ryan Blue <[email protected]>
> wrote:
> >
> > Hi everyone,
> >
> > In the DSv2 sync last night, we had a discussion about roadmap and what
> the goal should be for getting the main features into Spark. We all agreed
> that 3.0 should be that goal, even if it means delaying the 3.0 release.
> >
> > The possibility of delaying the 3.0 release may be controversial, so I
> want to bring it up to the dev list to build consensus around it. The
> rationale for this is partly that much of this work has been outstanding
> for more than a year now. If it doesn't make it into 3.0, then it would be
> another 6 months before it would be in a release, and would be nearing 2
> years to get the work done.
> >
> > Are there any objections to targeting 3.0 for this?
> >
> > In addition, much of the planning for multi-catalog support has been
> done to make v2 possible. Do we also want to include multi-catalog support?
> >
> >
> > rb
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Spark 3.0 and DataSourceV2

Reply via email to