This may be completely inappropriate and I apologize if it is, nevertheless, I am trying to get some clarification about the current status of DS.
Please tell me where I am wrong: Currently, the stable API is v1. There is a v2 DS API, but it is not widely used. The group is working on a “new” v2 API that will be available after the release of Spark v3. jg -- Jean Georges Perrin j...@jgp.net > On Apr 19, 2019, at 10:10, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Here are my notes from the last DSv2 sync. As always: > > If you’d like to attend the sync, send me an email and I’ll add you to the > invite. Everyone is welcome. > These notes are what I wrote down and remember. If you have corrections or > comments, please reply. > Topics: > > TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 > <https://github.com/apache/spark/pull/24246> > Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 > <https://github.com/apache/spark/pull/24233> > Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 > <https://github.com/apache/spark/pull/24129> > Attendees: > > Ryan Blue > John Zhuge > Matt Cheah > Yifei Huang > Bruce Robbins > Jamison Bennett > Russell Spitzer > Wenchen Fan > Yuanjian Li > > (and others who arrived after the start) > > Discussion: > > TableCatalog PR: https://github.com/apache/spark/pull/24246 > <https://github.com/apache/spark/pull/24246> > Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not > much discussion of content. > Wenchen: Easier to review if the changes to move Table and TableCapability > were in a separate PR (mostly import changes) > Ryan will open a separate PR for the move [Ed: #24410] > Russell: How should caching work? Has hit lots of problems with Spark caching > data and getting out of date > Ryan: Spark should always call into the catalog and not cache to avoid those > problems. However, Spark should ensure that it uses the same instance of a > Table for all scans in the same query, for consistent self-joins. > Some discussion of self joins. Conclusion was that we don’t need to worry > about this yet because it is unlikely. > Wenchen: should this include the namespace methods? > Ryan: No, those are a separate concern and can be added in a parallel PR. > Remove SaveMode PR: https://github.com/apache/spark/pull/24233 > <https://github.com/apache/spark/pull/24233> > Wenchen: PR is on hold waiting for streaming capabilities, #24129, because > the Noop sink doesn’t validate schema > Wenchen will open a PR to add a capability to opt out of schema validation, > then come back to this PR. > Streaming capabilities PR: https://github.com/apache/spark/pull/24129 > <https://github.com/apache/spark/pull/24129> > Ryan: This PR needs validation in the analyzer. The analyzer is where > validations should exist, or else validations must be copied into every code > path that produces a streaming plan. > Wenchen: the write check can’t be written because the write node is never > passed to the analyzer. Fixing that is a larger problem. > Ryan: Agree that refactoring to pass the write node to the analyzer should be > separate. > Wenchen: a check to ensure that either microbatch or continuous can be used > is hard because some sources may fall back > Ryan: By the time this check runs, fallback has happened. Do v1 sources > support continuous mode? > Wenchen: No, v1 doesn’t support continuous > Ryan: Then this can be written to assume that v1 sources only support > microbatch mode. > Wenchen will add this check > Wenchen: the check that tables in a v2 streaming relation support either > microbatch or continuous won’t catch anything and are unnecessary > Ryan: These checks still need to be in the analyzer so future uses do not > break. We had the same problem moving to v2: because schema checks were > specific to DataSource code paths, they were overlooked when adding v2. > Running validations in the analyzer avoids problems like this. > Wenchen will add the validation. > Matt: Will v2 be ready in time for the 3.0 release? > Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not > looking good. > -- > Ryan Blue > Software Engineer > Netflix