Re: DataSourceV2 sync, 17 April 2019

Ryan Blue Mon, 29 Apr 2019 09:31:55 -0700

That is mostly correct. V2 standardizes the behavior of logical operations
like CTAS across data sources, so it isn't compatible with v1 behavior.
Consequently, we can't just move to v2 easily. We have to maintain both in
parallel and eventually deprecate v1.


We are aiming to have a working v2 in Spark 3.0, but the community has not
committed to this goal. Support may be incomplete.

rb

On Sat, Apr 27, 2019 at 7:13 AM Jean Georges Perrin <j...@jgp.net> wrote:

> This may be completely inappropriate and I apologize if it is,
> nevertheless, I am trying to get some clarification about the current
> status of DS.
>
> Please tell me where I am wrong:
>
> Currently, the stable API is v1.
> There is a v2 DS API, but it is not widely used.
> The group is working on a “new” v2 API that will be available after the
> release of Spark v3.
>
> jg
>
> --
> Jean Georges Perrin
> j...@jgp.net
>
>
>
> On Apr 19, 2019, at 10:10, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>
> Here are my notes from the last DSv2 sync. As always:
>
>    - If you’d like to attend the sync, send me an email and I’ll add you
>    to the invite. Everyone is welcome.
>    - These notes are what I wrote down and remember. If you have
>    corrections or comments, please reply.
>
> *Topics*:
>
>    - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246
>    - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233
>    - Streaming capabilities PR #24129:
>    https://github.com/apache/spark/pull/24129
>
> *Attendees*:
>
> Ryan Blue
> John Zhuge
> Matt Cheah
> Yifei Huang
> Bruce Robbins
> Jamison Bennett
> Russell Spitzer
> Wenchen Fan
> Yuanjian Li
>
> (and others who arrived after the start)
>
> *Discussion*:
>
>    - TableCatalog PR: https://github.com/apache/spark/pull/24246
>       - Wenchen and Matt had just reviewed the PR. Mostly what was in the
>       SPIP so not much discussion of content.
>       - Wenchen: Easier to review if the changes to move Table and
>       TableCapability were in a separate PR (mostly import changes)
>       - Ryan will open a separate PR for the move [Ed: #24410]
>       - Russell: How should caching work? Has hit lots of problems with
>       Spark caching data and getting out of date
>       - Ryan: Spark should always call into the catalog and not cache to
>       avoid those problems. However, Spark should ensure that it uses the same
>       instance of a Table for all scans in the same query, for consistent
>       self-joins.
>       - Some discussion of self joins. Conclusion was that we don’t need
>       to worry about this yet because it is unlikely.
>       - Wenchen: should this include the namespace methods?
>       - Ryan: No, those are a separate concern and can be added in a
>       parallel PR.
>    - Remove SaveMode PR: https://github.com/apache/spark/pull/24233
>       - Wenchen: PR is on hold waiting for streaming capabilities,
>       #24129, because the Noop sink doesn’t validate schema
>       - Wenchen will open a PR to add a capability to opt out of schema
>       validation, then come back to this PR.
>    - Streaming capabilities PR: https://github.com/apache/spark/pull/24129
>       - Ryan: This PR needs validation in the analyzer. The analyzer is
>       where validations should exist, or else validations must be copied into
>       every code path that produces a streaming plan.
>       - Wenchen: the write check can’t be written because the write node
>       is never passed to the analyzer. Fixing that is a larger problem.
>       - Ryan: Agree that refactoring to pass the write node to the
>       analyzer should be separate.
>       - Wenchen: a check to ensure that either microbatch or continuous
>       can be used is hard because some sources may fall back
>       - Ryan: By the time this check runs, fallback has happened. Do v1
>       sources support continuous mode?
>       - Wenchen: No, v1 doesn’t support continuous
>       - Ryan: Then this can be written to assume that v1 sources only
>       support microbatch mode.
>       - Wenchen will add this check
>       - Wenchen: the check that tables in a v2 streaming relation support
>       either microbatch or continuous won’t catch anything and are unnecessary
>       - Ryan: These checks still need to be in the analyzer so future
>       uses do not break. We had the same problem moving to v2: because schema
>       checks were specific to DataSource code paths, they were overlooked when
>       adding v2. Running validations in the analyzer avoids problems like 
> this.
>       - Wenchen will add the validation.
>    - Matt: Will v2 be ready in time for the 3.0 release?
>       - Ryan: Once #24246 is in, we can work on PRs in parallel, but it
>       is not looking good.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 sync, 17 April 2019

Reply via email to