Here are my notes from the last DSv2 sync. Sorry it's a bit late! *Attendees*:
Ryan Blue John Zhuge Raynmond McCollum Terry Kim Gengliang Wang Jose Torres Wenchen Fan Priyanka Gomatam Matt Cheah Russel Spitzer Burak Yavuz *Topics*: - Check in on blockers - Remove SaveMode - Reorganize code - waiting for INSERT INTO? - Write docs - should be done after 3.0 branching - Open PRs - V2 session catalog config: https://github.com/apache/spark/pull/25104 - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040 - INSERT INTO: https://github.com/apache/spark/pull/24832 - SupportsNamespaces: https://github.com/apache/spark/pull/24560 - SHOW TABLES: https://github.com/apache/spark/pull/25247 - DELETE FROM: https://github.com/apache/spark/pull/21308 and https://github.com/apache/spark/pull/25115 - DELETE FROM approach - Filter push-down and stats - move to optimizer? - Use v2 ALTER TABLE implementations for v1 tables - CatalogPlugin changes - Reuse the existing Parquet readers? *Discussion*: - Blockers - Remove SaveMode from file sources: Blocked by TableProvider/CatalogPlugin changes. Doesn’t work with all of the using clauses from v1, like JDBC. Working on a CatalogPlugin fix. - Reorganize packages: Blocked by outstanding INSERT INTO PRs - Docs: Ryan: docs can be written after branching, so focus should be on stability right now - Any other blockers? Please send them to Ryan to track - V2 session catalog config PR: - Wenchen: this will be included in CatalogPlugin changes - DESCRIBE TABLE PR: - Matt: waiting for review - Burak: partitioning is strange, uses “Part 0” instead of names - Ryan: there are no names for transform partitions (identity partitions use column names) - Conclusion: not a big problem since there is no required schema, we can update later if better ideas come up - INSERT INTO PR: - Ryan: ready for another review, DataFrameWriter.insertInto PR will follow - SupportsNamespaces PR: - Ryan: ready for another review - SHOW TABLES PR: - Terry: there are open questions: what is the current database for v2? - Ryan: there should be a current namespace in the SessionState. This could be per catalog? - Conclusion: do not track current namespace per catalog. Reset to a catalog default when current catalog changes - Ryan: will add SupportsNamespace method for default namespace to initialize current. - Burak: USE foo.bar could set both - What is SupportsNamespaces is not implemented? Default to Seq.empty - Terry: should listing methods support search patterns? - Ryan: this adds complexity that should be handled by Spark instead of complicating the API. There isn’t a performance need to push this down because we don’t expect high cardinality for a namespace level. - Conclusion: implement in SHOW TABLES exec - Terry: how should temporary tables be handled? - Wenchen: temporary table is an alias for temporary view. SHOW TABLES does list temporary views, v2 should implement the same behavior. - Terry: support EXTENDED? - Ryan: This can be done later. - DELETE FROM PR: - Wenchen: DELETE FROM just passes filters to the data source to delete - Ryan: Instead of a complicated builder, let’s solve just the simple case (filters) and not the row-level delete case. If we do that, then we can use a simple SupportsDelete interface and put off row-level delete design - Consensus was to add a SupportsDelete interface for Table and not a new builder - Stats push-down fix: - Ryan: briefly looked into it and this can probably be done earlier, in the optimizer by creating a scan early and a special logical plan to wrap a scan. This isn’t a good long-term solution but would fix stats for the release. Write side would not change. - Ryan will submit a PR with the implementation - Using ALTER TABLE implementations for v1 - Burak: Took a stab at this, but ran into problems. Would be nice if all DDL for v1 were supported through v2 API - DDL doesn’t work with v1 for custom data sources - if the source of truth is not Hive - Matt: v2 should be used to change the source of truth. v1 behavior is to only change the session catalog (e.g., Hive). - Matt: is v1 deprecated? - Wenchen, not until stable - Burak: can’t deprecate yet - Burak: CTAS and RTAS could also call v1 - Ryan: We could build a v2 implementation that calls v1, but only append and read could be supported because v1 overwrite behavior is unreliable across sources. - Ran out of time - Wenchen’s CatalogPlugin changes can be discussed next time - Ryan will follow up with Raymond about reusing Parquet read path in other v2 sources -- Ryan Blue Software Engineer Netflix