Re: DataSourceV2 sync, 17 April 2019

Jean Georges Perrin Sat, 27 Apr 2019 07:14:09 -0700

This may be completely inappropriate and I apologize if it is, nevertheless, I 
am trying to get some clarification about the current status of DS.


Please tell me where I am wrong:

Currently, the stable API is v1.
There is a v2 DS API, but it is not widely used.
The group is working on a “new” v2 API that will be available after the release 
of Spark v3.

jg

--
Jean Georges Perrin
j...@jgp.net



> On Apr 19, 2019, at 10:10, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Here are my notes from the last DSv2 sync. As always:
> 
> If you’d like to attend the sync, send me an email and I’ll add you to the 
> invite. Everyone is welcome.
> These notes are what I wrote down and remember. If you have corrections or 
> comments, please reply.
> Topics:
> 
> TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 
> <https://github.com/apache/spark/pull/24246>
> Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 
> <https://github.com/apache/spark/pull/24233>
> Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 
> <https://github.com/apache/spark/pull/24129>
> Attendees:
> 
> Ryan Blue
> John Zhuge
> Matt Cheah
> Yifei Huang
> Bruce Robbins
> Jamison Bennett
> Russell Spitzer
> Wenchen Fan
> Yuanjian Li
> 
> (and others who arrived after the start)
> 
> Discussion:
> 
> TableCatalog PR: https://github.com/apache/spark/pull/24246 
> <https://github.com/apache/spark/pull/24246>
> Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not 
> much discussion of content.
> Wenchen: Easier to review if the changes to move Table and TableCapability 
> were in a separate PR (mostly import changes)
> Ryan will open a separate PR for the move [Ed: #24410]
> Russell: How should caching work? Has hit lots of problems with Spark caching 
> data and getting out of date
> Ryan: Spark should always call into the catalog and not cache to avoid those 
> problems. However, Spark should ensure that it uses the same instance of a 
> Table for all scans in the same query, for consistent self-joins.
> Some discussion of self joins. Conclusion was that we don’t need to worry 
> about this yet because it is unlikely.
> Wenchen: should this include the namespace methods?
> Ryan: No, those are a separate concern and can be added in a parallel PR.
> Remove SaveMode PR: https://github.com/apache/spark/pull/24233 
> <https://github.com/apache/spark/pull/24233>
> Wenchen: PR is on hold waiting for streaming capabilities, #24129, because 
> the Noop sink doesn’t validate schema
> Wenchen will open a PR to add a capability to opt out of schema validation, 
> then come back to this PR.
> Streaming capabilities PR: https://github.com/apache/spark/pull/24129 
> <https://github.com/apache/spark/pull/24129>
> Ryan: This PR needs validation in the analyzer. The analyzer is where 
> validations should exist, or else validations must be copied into every code 
> path that produces a streaming plan.
> Wenchen: the write check can’t be written because the write node is never 
> passed to the analyzer. Fixing that is a larger problem.
> Ryan: Agree that refactoring to pass the write node to the analyzer should be 
> separate.
> Wenchen: a check to ensure that either microbatch or continuous can be used 
> is hard because some sources may fall back
> Ryan: By the time this check runs, fallback has happened. Do v1 sources 
> support continuous mode?
> Wenchen: No, v1 doesn’t support continuous
> Ryan: Then this can be written to assume that v1 sources only support 
> microbatch mode.
> Wenchen will add this check
> Wenchen: the check that tables in a v2 streaming relation support either 
> microbatch or continuous won’t catch anything and are unnecessary
> Ryan: These checks still need to be in the analyzer so future uses do not 
> break. We had the same problem moving to v2: because schema checks were 
> specific to DataSource code paths, they were overlooked when adding v2. 
> Running validations in the analyzer avoids problems like this.
> Wenchen will add the validation.
> Matt: Will v2 be ready in time for the 3.0 release?
> Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not 
> looking good.
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: DataSourceV2 sync, 17 April 2019

Reply via email to