Here are my notes from the last DSv2 sync. As always: - If you’d like to attend the sync, send me an email and I’ll add you to the invite. Everyone is welcome. - These notes are what I wrote down and remember. If you have corrections or comments, please reply.
*Topics*: - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246 - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233 - Streaming capabilities PR #24129: https://github.com/apache/spark/pull/24129 *Attendees*: Ryan Blue John Zhuge Matt Cheah Yifei Huang Bruce Robbins Jamison Bennett Russell Spitzer Wenchen Fan Yuanjian Li (and others who arrived after the start) *Discussion*: - TableCatalog PR: https://github.com/apache/spark/pull/24246 - Wenchen and Matt had just reviewed the PR. Mostly what was in the SPIP so not much discussion of content. - Wenchen: Easier to review if the changes to move Table and TableCapability were in a separate PR (mostly import changes) - Ryan will open a separate PR for the move [Ed: #24410] - Russell: How should caching work? Has hit lots of problems with Spark caching data and getting out of date - Ryan: Spark should always call into the catalog and not cache to avoid those problems. However, Spark should ensure that it uses the same instance of a Table for all scans in the same query, for consistent self-joins. - Some discussion of self joins. Conclusion was that we don’t need to worry about this yet because it is unlikely. - Wenchen: should this include the namespace methods? - Ryan: No, those are a separate concern and can be added in a parallel PR. - Remove SaveMode PR: https://github.com/apache/spark/pull/24233 - Wenchen: PR is on hold waiting for streaming capabilities, #24129, because the Noop sink doesn’t validate schema - Wenchen will open a PR to add a capability to opt out of schema validation, then come back to this PR. - Streaming capabilities PR: https://github.com/apache/spark/pull/24129 - Ryan: This PR needs validation in the analyzer. The analyzer is where validations should exist, or else validations must be copied into every code path that produces a streaming plan. - Wenchen: the write check can’t be written because the write node is never passed to the analyzer. Fixing that is a larger problem. - Ryan: Agree that refactoring to pass the write node to the analyzer should be separate. - Wenchen: a check to ensure that either microbatch or continuous can be used is hard because some sources may fall back - Ryan: By the time this check runs, fallback has happened. Do v1 sources support continuous mode? - Wenchen: No, v1 doesn’t support continuous - Ryan: Then this can be written to assume that v1 sources only support microbatch mode. - Wenchen will add this check - Wenchen: the check that tables in a v2 streaming relation support either microbatch or continuous won’t catch anything and are unnecessary - Ryan: These checks still need to be in the analyzer so future uses do not break. We had the same problem moving to v2: because schema checks were specific to DataSource code paths, they were overlooked when adding v2. Running validations in the analyzer avoids problems like this. - Wenchen will add the validation. - Matt: Will v2 be ready in time for the 3.0 release? - Ryan: Once #24246 is in, we can work on PRs in parallel, but it is not looking good. -- Ryan Blue Software Engineer Netflix