Hi everyone, here are notes from the last DSv2 sync on 23 January 2019. Here are the highlights:
- Agreed that using v2 should not change behavior for file sources. (Cannot make this guarantee for all v1 sources) - Consensus for the approach proposed on the dev list for identifying tables with multiple catalogs. Ryan will write an SPIP. - Seemed to have agreement for using mix-in traits to make the new catalog API (TableCatalog, FunctionCatalog, etc.) - Agreement for using small, well-defined changes in the alter table API call. As usual, these are my summary. Feel free to send corrections or clarifications. *Topics*: - Ongoing pull requests: - #23606 <https://github.com/apache/spark/pull/23606>: Add overwrite plans - CatalogIdentifier proposal, continuing last discussion and thread on the dev list - Table changes proposed by the TableCatalog SPIP <https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit?ts=5c4e0b8c#heading=h.2fs3bpu4be7q> - Table metadata from the TableCatalog proposal *Notes*: - Hyukjin asked about compatibility. Has had problems moving to v2 - File API for DSv2 may have a compatibility problem (didn’t include specifics, will post details to dev@) - From previous discussions: *goal for compatibility is to not change behavior.* - Ryan: How to avoid changing behavior is not very clear. I think that requires maintaining v1 and allowing users to switch to v1 when they need to. V2 standardizes behavior that is not consistent across sources in V1, so there is no way to avoid some behavior change to bring sources in line. The agreement was to standardize on the behavior of existing file sources. - Overwrite pull request, #23606 <https://github.com/apache/spark/pull/23606> - Xiao noted that the PR was posted before tests were added and should include a “WIP” note [Ed: Fixed by adding tests] - Xiao asked for a link to the logical plans SPIP <https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d> in the description [Ed: Done] - Xiao: The PR uses dynamic partition replacement for DataFrameWriter.mode("overwrite").save(), but the current behavior of save is to truncate the table. - Ryan: Will test the current behavior and update the PR to what the file source currently does. [Ed: Fixed. Xiao was right that the correct behavior is truncate.] - Ryan: for anyone that wants to look at a complete implementation of DSv2 in Spark can look at Netflix’s branch. Email me for details. Eric asked to be added to the repository. - CatalogIdentifier proposal - Ryan: summarized CatalogIdentifier proposal posted to the dev@ list. The proposal allows multi-level namespaces, but avoids behavior where Spark would need to make multiple calls to traverse complex namespaces in external catalogs. - Consensus was that this proposal will satisfy concerns from the previous discussion. Ryan will write a doc for a vote to make this a SPIP. - Xiao: the SPIP should also cover column identifiers. - Ryan: not sure that column identifiers are needed. Table identification provides a search space, column identifiers are resolved in that space. That makes these separate concerns, though column identifier resolution is important. - Xiao: will send more information to the dev list. - Ryan: will take a look at the docs Xiao sends to determine whether it needs to be included. Felix also suggested this. - Maryann: would the API include the ability to create a database/namespace? How would this handle namespaces that do not exist? - Ryan: we could add an API to create namespaces. The same approach should be taken here to avoid repeated calls. Spark should not be responsible for creating a chain of namespaces or checking a parent namespace’s existence. This should be left to plugin implementations. - Maryann: agreed that the implementation should determine how to implement create and drop. - Ryan: we might want to isolate Namespace create/load/drop in a mix-in trait for catalogs that support namespaces. Not all catalog types will support namespace creation; e.g., UDF catalogs. Could also have a trait for namespace support, so Spark rejects non-empty namespaces for catalogs automatically. - Why split the catalog interface into TableCatalog, FunctionCatalog, etc.? - Maryann brought up this topic after the possibility of a NamespaceCatalog mix-in - Ryan: Not all catalogs will support all of these uses, and grouping support into mix-ins provides a simple check. - Maryann: Why not use a single interface that throws UnsupportedOperationException for unimplemented methods? - Ryan: Two problems with throwing UOE. First, if a catalog to support CTAS, it must implement 3 methods: exists, create, and drop. A catalog mix-in can make a base set of methods required and Spark can check for that set by checking for the trait. - Ryan: Second problem with throwing UOE: If a catalog doesn’t support, for example, drop, then Spark doesn’t find out until it attempts to drop a table. This is too late because Spark calls drop to roll back a CTAS operation, after creating the table when a write fails. This means that Spark requires a lot of code paths to handle UOE, which is hard to maintain. - Matt: In other words, this is like a compile time vs runtime problem. Spark wouldn’t be able to catch problems before a query runs. - Ryan: Agree with the runtime analogy, but it also causes Spark to need more code that handles a catalog method throwing UOE at runtime. Spark should be able to assume a TableCatalog implements drop without throwing UOE [Ed: even if it is a noop]. - Wenchen: Spark could use capabilities to avoid that drop problem by checking support for each required method before running a query. - Ryan: The capability approach requires Spark to maintain a list of capabilities required for each operation and requires the implementation to provide a list of supported methods. Both are difficult to maintain. If a plugin forgets to add a drop capability, Spark will not allow CTAS. Similarly, if Spark forgets to check an operation for a capability, that’s a problem. Maintaining compatibility in multiple places is going to lead to problems. - There seemed to be a *tentative consensus around using mix-in traits for catalogs.* - Follow up discussion on table changes for TableCatalog#alterTable: - Ryan: considered Maryann’s suggestion to support a change to replace all table columns, which may be easier for some sources. But this is ambiguous so it would lead to unreliable behavior. - Maryann: agreed that replacing all columns is not needed. - Ryan: Glad we agree. Here’s an example to demonstrate for the group: consider replacing (a:int, b:int) with (x:int, y:int). That could be interpreted as 2 renames, or could be 2 column deletes and 2 column adds. Using simple well-define operations would be better. - Xiao: Please update the SPIP to include all changes required to implement SQL at Netflix [Ed: Done] - Xiao: Please include a change to reorder columns also. [Ed: Done] - Table metadata fields: schema, partitioning, sorting, properties - Ryan: next time, we should talk about the set of metadata proposed for TableCatalog, but we’re out of time. *Attendees*: Ryan Blue John Zhuge Reynold Xin Xiao Li Dongjoon Hyun Eric Wohlstadter Hyukjin Kwon Jacky Lee Jamison Bennett Kevin Yu Yuanjian Li Maryann Xue Matt Cheah Dale Richardson -- Ryan Blue Software Engineer Netflix