Here are my notes from last night’s sync. I had to leave early, so there
may be more discussion. Others can fill in the details for those topics.
*Attendees*:
John Zhuge
Ryan Blue
Yifei Huang
Matt Cheah
Yuanjian Li
Russell Spitzer
Kevin Yu
*Topics*:
- Atomic extensions for the TableCatalog API
- Moving DSv2 to Catalyst - should this include package renames?
- Catalogs and table resolution: proposal to prefer default v2 catalog
when defined
*Notes*:
- Skipping discussion of open PRs
- Atomic table catalogs:
- Matt: the proposal in the SPIP makes sense. When should Spark use
the atomic API? Is there a way for a user to signal that Spark should use
the staging calls? Spark could use SQL transaction statements for this.
- Ryan: the atomic operations that we are currently targeting with
the TableCatalog extensions are single statements, like CREATE TABLE AS
SELECT. Transaction statements (e.g., BEGIN) are for multi-statement
transactions and are out of scope.
- Ryan: Because the expected behavior of the commands (CTAS, RTAS) is
that atomic, Spark should use always use atomic implementations
if they are
available. No need for a user to opt in.
- Matt: What should REPLACE TABLE do if transactions are not
supported? If the write fails, the table would be deleted
- Ryan: REPLACE is a combination of DROP TABLE and CREATE TABLE AS
SELECT. By using it, user is signaling that if a combined operation is
possible, Spark should use it. So REPLACE TABLE signals intent
to drop and
it is the right thing to drop the table if an atomic replace is not
supported.
- There was also some confusion about whether IF EXISTS should be
supported. The consensus was that REPLACE TABLE AS SELECT is
expected to be
idempotent and should not fail if the target table does not exist.
- Moving DSv2 to catalyst - skipped because Wenchen did not attend
- Catalogs and table resolution:
- Ryan: Table resolution with catalogs is getting complicated when
namespaces overlap. If an identifier has a catalog, then it is
easy to use
a v2 catalog. But when the identifier does not have a catalog, there is a
namespace overlap between session catalog tables and the default
v2 catalog
tables. It would be much easier to understand and document if we used a
simple rule for precedence. We suggest using session catalog unless the
default v2 catalog is defined, then using the v2 catalog by default.
- This makes the behavior easy to document and reason about, with few
special cases. To guarantee compatibility, we will need a v2
implementation
that delegates to session catalog.
- Ryan: If there aren’t objections, I’ll raise this on the dev list.
We should make a decision there.
--
Ryan Blue
Software Engineer
Netflix