Here are my notes from the last DSv2 sync. Sorry it's a bit late!
*Attendees*:
Ryan Blue
John Zhuge
Raynmond McCollum
Terry Kim
Gengliang Wang
Jose Torres
Wenchen Fan
Priyanka Gomatam
Matt Cheah
Russel Spitzer
Burak Yavuz
*Topics*:
- Check in on blockers
- Remove SaveMode
- Reorganize code - waiting for INSERT INTO?
- Write docs - should be done after 3.0 branching
- Open PRs
- V2 session catalog config:
https://github.com/apache/spark/pull/25104
- DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
- INSERT INTO: https://github.com/apache/spark/pull/24832
- SupportsNamespaces: https://github.com/apache/spark/pull/24560
- SHOW TABLES: https://github.com/apache/spark/pull/25247
- DELETE FROM: https://github.com/apache/spark/pull/21308 and
https://github.com/apache/spark/pull/25115
- DELETE FROM approach
- Filter push-down and stats - move to optimizer?
- Use v2 ALTER TABLE implementations for v1 tables
- CatalogPlugin changes
- Reuse the existing Parquet readers?
*Discussion*:
- Blockers
- Remove SaveMode from file sources: Blocked by
TableProvider/CatalogPlugin changes. Doesn’t work with all of the using
clauses from v1, like JDBC. Working on a CatalogPlugin fix.
- Reorganize packages: Blocked by outstanding INSERT INTO PRs
- Docs: Ryan: docs can be written after branching, so focus should be
on stability right now
- Any other blockers? Please send them to Ryan to track
- V2 session catalog config PR:
- Wenchen: this will be included in CatalogPlugin changes
- DESCRIBE TABLE PR:
- Matt: waiting for review
- Burak: partitioning is strange, uses “Part 0” instead of names
- Ryan: there are no names for transform partitions (identity
partitions use column names)
- Conclusion: not a big problem since there is no required schema, we
can update later if better ideas come up
- INSERT INTO PR:
- Ryan: ready for another review, DataFrameWriter.insertInto PR will
follow
- SupportsNamespaces PR:
- Ryan: ready for another review
- SHOW TABLES PR:
- Terry: there are open questions: what is the current database for
v2?
- Ryan: there should be a current namespace in the SessionState. This
could be per catalog?
- Conclusion: do not track current namespace per catalog. Reset to a
catalog default when current catalog changes
- Ryan: will add SupportsNamespace method for default namespace to
initialize current.
- Burak: USE foo.bar could set both
- What is SupportsNamespaces is not implemented? Default to Seq.empty
- Terry: should listing methods support search patterns?
- Ryan: this adds complexity that should be handled by Spark instead
of complicating the API. There isn’t a performance need to push this down
because we don’t expect high cardinality for a namespace level.
- Conclusion: implement in SHOW TABLES exec
- Terry: how should temporary tables be handled?
- Wenchen: temporary table is an alias for temporary view. SHOW TABLES
does list temporary views, v2 should implement the same behavior.
- Terry: support EXTENDED?
- Ryan: This can be done later.
- DELETE FROM PR:
- Wenchen: DELETE FROM just passes filters to the data source to
delete
- Ryan: Instead of a complicated builder, let’s solve just the simple
case (filters) and not the row-level delete case. If we do that, then we
can use a simple SupportsDelete interface and put off row-level delete
design
- Consensus was to add a SupportsDelete interface for Table and not a
new builder
- Stats push-down fix:
- Ryan: briefly looked into it and this can probably be done earlier,
in the optimizer by creating a scan early and a special logical plan to
wrap a scan. This isn’t a good long-term solution but would fix stats for
the release. Write side would not change.
- Ryan will submit a PR with the implementation
- Using ALTER TABLE implementations for v1
- Burak: Took a stab at this, but ran into problems. Would be nice if
all DDL for v1 were supported through v2 API
- DDL doesn’t work with v1 for custom data sources - if the source of
truth is not Hive
- Matt: v2 should be used to change the source of truth. v1 behavior
is to only change the session catalog (e.g., Hive).
- Matt: is v1 deprecated?
- Wenchen, not until stable
- Burak: can’t deprecate yet
- Burak: CTAS and RTAS could also call v1
- Ryan: We could build a v2 implementation that calls v1, but only
append and read could be supported because v1 overwrite behavior is
unreliable across sources.
- Ran out of time
- Wenchen’s CatalogPlugin changes can be discussed next time
- Ryan will follow up with Raymond about reusing Parquet read path in
other v2 sources
--
Ryan Blue
Software Engineer
Netflix