DataSourceV2 sync notes

Ryan Blue Mon, 28 Jan 2019 11:08:57 -0800

Hi everyone, here are notes from the last DSv2 sync on 23 January 2019.

Here are the highlights:

- Agreed that using v2 should not change behavior for file sources.
(Cannot make this guarantee for all v1 sources)
- Consensus for the approach proposed on the dev list for identifying
tables with multiple catalogs. Ryan will write an SPIP.
- Seemed to have agreement for using mix-in traits to make the new
catalog API (TableCatalog, FunctionCatalog, etc.)
- Agreement for using small, well-defined changes in the alter table API
call.

As usual, these are my summary. Feel free to send corrections or
clarifications.

*Topics*:

- Ongoing pull requests:
- #23606 <https://github.com/apache/spark/pull/23606>: Add overwrite
plans
- CatalogIdentifier proposal, continuing last discussion and thread on
the dev list
- Table changes proposed by the TableCatalog SPIP

<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit?ts=5c4e0b8c#heading=h.2fs3bpu4be7q>
- Table metadata from the TableCatalog proposal

*Notes*:

- Hyukjin asked about compatibility. Has had problems moving to v2
- File API for DSv2 may have a compatibility problem (didn’t include
specifics, will post details to dev@)
- From previous discussions: *goal for compatibility is to not change
behavior.*
- Ryan: How to avoid changing behavior is not very clear. I think
that requires maintaining v1 and allowing users to switch to v1 when they
need to. V2 standardizes behavior that is not consistent across
sources in
V1, so there is no way to avoid some behavior change to bring sources in
line. The agreement was to standardize on the behavior of existing file
sources.
- Overwrite pull request, #23606
<https://github.com/apache/spark/pull/23606>
- Xiao noted that the PR was posted before tests were added and
should include a “WIP” note [Ed: Fixed by adding tests]
- Xiao asked for a link to the logical plans SPIP

<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>
in the description [Ed: Done]
- Xiao: The PR uses dynamic partition replacement for
DataFrameWriter.mode("overwrite").save(), but the current behavior of
save is to truncate the table.
- Ryan: Will test the current behavior and update the PR to what the
file source currently does. [Ed: Fixed. Xiao was right that the correct
behavior is truncate.]
- Ryan: for anyone that wants to look at a complete implementation of
DSv2 in Spark can look at Netflix’s branch. Email me for details. Eric
asked to be added to the repository.
- CatalogIdentifier proposal
- Ryan: summarized CatalogIdentifier proposal posted to the dev@
list. The proposal allows multi-level namespaces, but avoids
behavior where
Spark would need to make multiple calls to traverse complex namespaces in
external catalogs.
- Consensus was that this proposal will satisfy concerns from the
previous discussion. Ryan will write a doc for a vote to make
this a SPIP.
- Xiao: the SPIP should also cover column identifiers.
- Ryan: not sure that column identifiers are needed. Table
identification provides a search space, column identifiers are
resolved in
that space. That makes these separate concerns, though column identifier
resolution is important.
- Xiao: will send more information to the dev list.
- Ryan: will take a look at the docs Xiao sends to determine whether
it needs to be included. Felix also suggested this.
- Maryann: would the API include the ability to create a
database/namespace? How would this handle namespaces that do not exist?
- Ryan: we could add an API to create namespaces. The same approach
should be taken here to avoid repeated calls. Spark should not be
responsible for creating a chain of namespaces or checking a parent
namespace’s existence. This should be left to plugin implementations.
- Maryann: agreed that the implementation should determine how to
implement create and drop.
- Ryan: we might want to isolate Namespace create/load/drop in a
mix-in trait for catalogs that support namespaces. Not all catalog types
will support namespace creation; e.g., UDF catalogs. Could also have a
trait for namespace support, so Spark rejects non-empty namespaces for
catalogs automatically.
- Why split the catalog interface into TableCatalog, FunctionCatalog,
etc.?
- Maryann brought up this topic after the possibility of a
NamespaceCatalog mix-in
- Ryan: Not all catalogs will support all of these uses, and grouping
support into mix-ins provides a simple check.
- Maryann: Why not use a single interface that throws
UnsupportedOperationException for unimplemented methods?
- Ryan: Two problems with throwing UOE. First, if a catalog to
support CTAS, it must implement 3 methods: exists, create, and drop. A
catalog mix-in can make a base set of methods required and Spark
can check
for that set by checking for the trait.
- Ryan: Second problem with throwing UOE: If a catalog doesn’t
support, for example, drop, then Spark doesn’t find out until it attempts
to drop a table. This is too late because Spark calls drop to roll back a
CTAS operation, after creating the table when a write fails. This means
that Spark requires a lot of code paths to handle UOE, which is hard to
maintain.
- Matt: In other words, this is like a compile time vs runtime
problem. Spark wouldn’t be able to catch problems before a query runs.
- Ryan: Agree with the runtime analogy, but it also causes Spark to
need more code that handles a catalog method throwing UOE at
runtime. Spark
should be able to assume a TableCatalog implements drop without throwing
UOE [Ed: even if it is a noop].
- Wenchen: Spark could use capabilities to avoid that drop problem by
checking support for each required method before running a query.
- Ryan: The capability approach requires Spark to maintain a list of
capabilities required for each operation and requires the
implementation to
provide a list of supported methods. Both are difficult to maintain. If a
plugin forgets to add a drop capability, Spark will not allow CTAS.
Similarly, if Spark forgets to check an operation for a
capability, that’s
a problem. Maintaining compatibility in multiple places is going
to lead to
problems.
- There seemed to be a *tentative consensus around using mix-in
traits for catalogs.*
- Follow up discussion on table changes for TableCatalog#alterTable:
- Ryan: considered Maryann’s suggestion to support a change to
replace all table columns, which may be easier for some sources. But this
is ambiguous so it would lead to unreliable behavior.
- Maryann: agreed that replacing all columns is not needed.
- Ryan: Glad we agree. Here’s an example to demonstrate for the
group: consider replacing (a:int, b:int) with (x:int, y:int). That
could be interpreted as 2 renames, or could be 2 column deletes and 2
column adds. Using simple well-define operations would be better.
- Xiao: Please update the SPIP to include all changes required to
implement SQL at Netflix [Ed: Done]
- Xiao: Please include a change to reorder columns also. [Ed: Done]
- Table metadata fields: schema, partitioning, sorting, properties
- Ryan: next time, we should talk about the set of metadata proposed
for TableCatalog, but we’re out of time.

*Attendees*:

Ryan Blue
John Zhuge
Reynold Xin
Xiao Li
Dongjoon Hyun
Eric Wohlstadter
Hyukjin Kwon
Jacky Lee
Jamison Bennett
Kevin Yu
Yuanjian Li
Maryann Xue
Matt Cheah
Dale Richardson
--
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes

Reply via email to