DataSourceV2 sync notes - 24 July 2019

Ryan Blue Tue, 06 Aug 2019 13:35:33 -0700

Here are my notes from the last DSv2 sync. Sorry it's a bit late!

*Attendees*:


Ryan Blue
John Zhuge
Raynmond McCollum
Terry Kim
Gengliang Wang
Jose Torres
Wenchen Fan
Priyanka Gomatam
Matt Cheah
Russel Spitzer
Burak Yavuz

*Topics*:

   - Check in on blockers
      - Remove SaveMode
      - Reorganize code - waiting for INSERT INTO?
      - Write docs - should be done after 3.0 branching
   - Open PRs
      - V2 session catalog config:
      https://github.com/apache/spark/pull/25104
      - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
      - INSERT INTO: https://github.com/apache/spark/pull/24832
      - SupportsNamespaces: https://github.com/apache/spark/pull/24560
      - SHOW TABLES: https://github.com/apache/spark/pull/25247
      - DELETE FROM: https://github.com/apache/spark/pull/21308 and
      https://github.com/apache/spark/pull/25115
   - DELETE FROM approach
   - Filter push-down and stats - move to optimizer?
   - Use v2 ALTER TABLE implementations for v1 tables
   - CatalogPlugin changes
   - Reuse the existing Parquet readers?

*Discussion*:

   - Blockers
      - Remove SaveMode from file sources: Blocked by
      TableProvider/CatalogPlugin changes. Doesn’t work with all of the using
      clauses from v1, like JDBC. Working on a CatalogPlugin fix.
      - Reorganize packages: Blocked by outstanding INSERT INTO PRs
      - Docs: Ryan: docs can be written after branching, so focus should be
      on stability right now
      - Any other blockers? Please send them to Ryan to track
   - V2 session catalog config PR:
      - Wenchen: this will be included in CatalogPlugin changes
   - DESCRIBE TABLE PR:
      - Matt: waiting for review
      - Burak: partitioning is strange, uses “Part 0” instead of names
      - Ryan: there are no names for transform partitions (identity
      partitions use column names)
      - Conclusion: not a big problem since there is no required schema, we
      can update later if better ideas come up
   - INSERT INTO PR:
      - Ryan: ready for another review, DataFrameWriter.insertInto PR will
      follow
   - SupportsNamespaces PR:
      - Ryan: ready for another review
   - SHOW TABLES PR:
      - Terry: there are open questions: what is the current database for
      v2?
      - Ryan: there should be a current namespace in the SessionState. This
      could be per catalog?
      - Conclusion: do not track current namespace per catalog. Reset to a
      catalog default when current catalog changes
      - Ryan: will add SupportsNamespace method for default namespace to
      initialize current.
      - Burak: USE foo.bar could set both
      - What is SupportsNamespaces is not implemented? Default to Seq.empty
      - Terry: should listing methods support search patterns?
      - Ryan: this adds complexity that should be handled by Spark instead
      of complicating the API. There isn’t a performance need to push this down
      because we don’t expect high cardinality for a namespace level.
      - Conclusion: implement in SHOW TABLES exec
      - Terry: how should temporary tables be handled?
      - Wenchen: temporary table is an alias for temporary view. SHOW TABLES
      does list temporary views, v2 should implement the same behavior.
      - Terry: support EXTENDED?
      - Ryan: This can be done later.
   - DELETE FROM PR:
      - Wenchen: DELETE FROM just passes filters to the data source to
      delete
      - Ryan: Instead of a complicated builder, let’s solve just the simple
      case (filters) and not the row-level delete case. If we do that, then we
      can use a simple SupportsDelete interface and put off row-level delete
      design
      - Consensus was to add a SupportsDelete interface for Table and not a
      new builder
   - Stats push-down fix:
      - Ryan: briefly looked into it and this can probably be done earlier,
      in the optimizer by creating a scan early and a special logical plan to
      wrap a scan. This isn’t a good long-term solution but would fix stats for
      the release. Write side would not change.
      - Ryan will submit a PR with the implementation
   - Using ALTER TABLE implementations for v1
      - Burak: Took a stab at this, but ran into problems. Would be nice if
      all DDL for v1 were supported through v2 API
      - DDL doesn’t work with v1 for custom data sources - if the source of
      truth is not Hive
      - Matt: v2 should be used to change the source of truth. v1 behavior
      is to only change the session catalog (e.g., Hive).
      - Matt: is v1 deprecated?
      - Wenchen, not until stable
      - Burak: can’t deprecate yet
      - Burak: CTAS and RTAS could also call v1
      - Ryan: We could build a v2 implementation that calls v1, but only
      append and read could be supported because v1 overwrite behavior is
      unreliable across sources.
   - Ran out of time
      - Wenchen’s CatalogPlugin changes can be discussed next time
      - Ryan will follow up with Raymond about reusing Parquet read path in
      other v2 sources

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 24 July 2019

Reply via email to