DataSourceV2 sync, 17 April 2019

Ryan Blue Fri, 19 Apr 2019 10:11:00 -0700

Here are my notes from the last DSv2 sync. As always:

   - If you’d like to attend the sync, send me an email and I’ll add you to
   the invite. Everyone is welcome.
   - These notes are what I wrote down and remember. If you have
   corrections or comments, please reply.


*Topics*:

   - TableCatalog PR #24246: https://github.com/apache/spark/pull/24246
   - Remove SaveMode PR #24233: https://github.com/apache/spark/pull/24233
   - Streaming capabilities PR #24129:
   https://github.com/apache/spark/pull/24129

*Attendees*:

Ryan Blue
John Zhuge
Matt Cheah
Yifei Huang
Bruce Robbins
Jamison Bennett
Russell Spitzer
Wenchen Fan
Yuanjian Li

(and others who arrived after the start)

*Discussion*:

   - TableCatalog PR: https://github.com/apache/spark/pull/24246
      - Wenchen and Matt had just reviewed the PR. Mostly what was in the
      SPIP so not much discussion of content.
      - Wenchen: Easier to review if the changes to move Table and
      TableCapability were in a separate PR (mostly import changes)
      - Ryan will open a separate PR for the move [Ed: #24410]
      - Russell: How should caching work? Has hit lots of problems with
      Spark caching data and getting out of date
      - Ryan: Spark should always call into the catalog and not cache to
      avoid those problems. However, Spark should ensure that it uses the same
      instance of a Table for all scans in the same query, for consistent
      self-joins.
      - Some discussion of self joins. Conclusion was that we don’t need to
      worry about this yet because it is unlikely.
      - Wenchen: should this include the namespace methods?
      - Ryan: No, those are a separate concern and can be added in a
      parallel PR.
   - Remove SaveMode PR: https://github.com/apache/spark/pull/24233
      - Wenchen: PR is on hold waiting for streaming capabilities, #24129,
      because the Noop sink doesn’t validate schema
      - Wenchen will open a PR to add a capability to opt out of schema
      validation, then come back to this PR.
   - Streaming capabilities PR: https://github.com/apache/spark/pull/24129
      - Ryan: This PR needs validation in the analyzer. The analyzer is
      where validations should exist, or else validations must be copied into
      every code path that produces a streaming plan.
      - Wenchen: the write check can’t be written because the write node is
      never passed to the analyzer. Fixing that is a larger problem.
      - Ryan: Agree that refactoring to pass the write node to the analyzer
      should be separate.
      - Wenchen: a check to ensure that either microbatch or continuous can
      be used is hard because some sources may fall back
      - Ryan: By the time this check runs, fallback has happened. Do v1
      sources support continuous mode?
      - Wenchen: No, v1 doesn’t support continuous
      - Ryan: Then this can be written to assume that v1 sources only
      support microbatch mode.
      - Wenchen will add this check
      - Wenchen: the check that tables in a v2 streaming relation support
      either microbatch or continuous won’t catch anything and are unnecessary
      - Ryan: These checks still need to be in the analyzer so future uses
      do not break. We had the same problem moving to v2: because schema checks
      were specific to DataSource code paths, they were overlooked when adding
      v2. Running validations in the analyzer avoids problems like this.
      - Wenchen will add the validation.
   - Matt: Will v2 be ready in time for the 3.0 release?
      - Ryan: Once #24246 is in, we can work on PRs in parallel, but it is
      not looking good.

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync, 17 April 2019

Reply via email to