Hi everyone, here are notes from the last DSv2 sync on 23 January 2019.

Here are the highlights:

   - Agreed that using v2 should not change behavior for file sources.
   (Cannot make this guarantee for all v1 sources)
   - Consensus for the approach proposed on the dev list for identifying
   tables with multiple catalogs. Ryan will write an SPIP.
   - Seemed to have agreement for using mix-in traits to make the new
   catalog API (TableCatalog, FunctionCatalog, etc.)
   - Agreement for using small, well-defined changes in the alter table API
   call.

As usual, these are my summary. Feel free to send corrections or
clarifications.

*Topics*:

   - Ongoing pull requests:
      - #23606 <https://github.com/apache/spark/pull/23606>: Add overwrite
      plans
   - CatalogIdentifier proposal, continuing last discussion and thread on
   the dev list
   - Table changes proposed by the TableCatalog SPIP
   
<https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit?ts=5c4e0b8c#heading=h.2fs3bpu4be7q>
   - Table metadata from the TableCatalog proposal

*Notes*:

   - Hyukjin asked about compatibility. Has had problems moving to v2
      - File API for DSv2 may have a compatibility problem (didn’t include
      specifics, will post details to dev@)
      - From previous discussions: *goal for compatibility is to not change
      behavior.*
      - Ryan: How to avoid changing behavior is not very clear. I think
      that requires maintaining v1 and allowing users to switch to v1 when they
      need to. V2 standardizes behavior that is not consistent across
sources in
      V1, so there is no way to avoid some behavior change to bring sources in
      line. The agreement was to standardize on the behavior of existing file
      sources.
   - Overwrite pull request, #23606
   <https://github.com/apache/spark/pull/23606>
      - Xiao noted that the PR was posted before tests were added and
      should include a “WIP” note [Ed: Fixed by adding tests]
      - Xiao asked for a link to the logical plans SPIP
      
<https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d>
      in the description [Ed: Done]
      - Xiao: The PR uses dynamic partition replacement for
      DataFrameWriter.mode("overwrite").save(), but the current behavior of
      save is to truncate the table.
      - Ryan: Will test the current behavior and update the PR to what the
      file source currently does. [Ed: Fixed. Xiao was right that the correct
      behavior is truncate.]
      - Ryan: for anyone that wants to look at a complete implementation of
      DSv2 in Spark can look at Netflix’s branch. Email me for details. Eric
      asked to be added to the repository.
   - CatalogIdentifier proposal
      - Ryan: summarized CatalogIdentifier proposal posted to the dev@
      list. The proposal allows multi-level namespaces, but avoids
behavior where
      Spark would need to make multiple calls to traverse complex namespaces in
      external catalogs.
      - Consensus was that this proposal will satisfy concerns from the
      previous discussion. Ryan will write a doc for a vote to make
this a SPIP.
      - Xiao: the SPIP should also cover column identifiers.
      - Ryan: not sure that column identifiers are needed. Table
      identification provides a search space, column identifiers are
resolved in
      that space. That makes these separate concerns, though column identifier
      resolution is important.
      - Xiao: will send more information to the dev list.
      - Ryan: will take a look at the docs Xiao sends to determine whether
      it needs to be included. Felix also suggested this.
      - Maryann: would the API include the ability to create a
      database/namespace? How would this handle namespaces that do not exist?
      - Ryan: we could add an API to create namespaces. The same approach
      should be taken here to avoid repeated calls. Spark should not be
      responsible for creating a chain of namespaces or checking a parent
      namespace’s existence. This should be left to plugin implementations.
      - Maryann: agreed that the implementation should determine how to
      implement create and drop.
      - Ryan: we might want to isolate Namespace create/load/drop in a
      mix-in trait for catalogs that support namespaces. Not all catalog types
      will support namespace creation; e.g., UDF catalogs. Could also have a
      trait for namespace support, so Spark rejects non-empty namespaces for
      catalogs automatically.
   - Why split the catalog interface into TableCatalog, FunctionCatalog,
   etc.?
      - Maryann brought up this topic after the possibility of a
      NamespaceCatalog mix-in
      - Ryan: Not all catalogs will support all of these uses, and grouping
      support into mix-ins provides a simple check.
      - Maryann: Why not use a single interface that throws
      UnsupportedOperationException for unimplemented methods?
      - Ryan: Two problems with throwing UOE. First, if a catalog to
      support CTAS, it must implement 3 methods: exists, create, and drop. A
      catalog mix-in can make a base set of methods required and Spark
can check
      for that set by checking for the trait.
      - Ryan: Second problem with throwing UOE: If a catalog doesn’t
      support, for example, drop, then Spark doesn’t find out until it attempts
      to drop a table. This is too late because Spark calls drop to roll back a
      CTAS operation, after creating the table when a write fails. This means
      that Spark requires a lot of code paths to handle UOE, which is hard to
      maintain.
      - Matt: In other words, this is like a compile time vs runtime
      problem. Spark wouldn’t be able to catch problems before a query runs.
      - Ryan: Agree with the runtime analogy, but it also causes Spark to
      need more code that handles a catalog method throwing UOE at
runtime. Spark
      should be able to assume a TableCatalog implements drop without throwing
      UOE [Ed: even if it is a noop].
      - Wenchen: Spark could use capabilities to avoid that drop problem by
      checking support for each required method before running a query.
      - Ryan: The capability approach requires Spark to maintain a list of
      capabilities required for each operation and requires the
implementation to
      provide a list of supported methods. Both are difficult to maintain. If a
      plugin forgets to add a drop capability, Spark will not allow CTAS.
      Similarly, if Spark forgets to check an operation for a
capability, that’s
      a problem. Maintaining compatibility in multiple places is going
to lead to
      problems.
      - There seemed to be a *tentative consensus around using mix-in
      traits for catalogs.*
   - Follow up discussion on table changes for TableCatalog#alterTable:
      - Ryan: considered Maryann’s suggestion to support a change to
      replace all table columns, which may be easier for some sources. But this
      is ambiguous so it would lead to unreliable behavior.
      - Maryann: agreed that replacing all columns is not needed.
      - Ryan: Glad we agree. Here’s an example to demonstrate for the
      group: consider replacing (a:int, b:int) with (x:int, y:int). That
      could be interpreted as 2 renames, or could be 2 column deletes and 2
      column adds. Using simple well-define operations would be better.
      - Xiao: Please update the SPIP to include all changes required to
      implement SQL at Netflix [Ed: Done]
      - Xiao: Please include a change to reorder columns also. [Ed: Done]
   - Table metadata fields: schema, partitioning, sorting, properties
      - Ryan: next time, we should talk about the set of metadata proposed
      for TableCatalog, but we’re out of time.

*Attendees*:

Ryan Blue
John Zhuge
Reynold Xin
Xiao Li
Dongjoon Hyun
Eric Wohlstadter
Hyukjin Kwon
Jacky Lee
Jamison Bennett
Kevin Yu
Yuanjian Li
Maryann Xue
Matt Cheah
Dale Richardson
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to