DSv2 sync notes - 11 December 2019

Ryan Blue Thu, 19 Dec 2019 17:20:23 -0800

Hi everyone, here are my notes for the DSv2 sync last week. Sorry they’re
late! Feel free to add more details or corrections. Thanks!


rb

*Attendees*:

Ryan Blue
John Zhuge
Dongjoon Hyun
Joseph Torres
Kevin Yu
Russel Spitzer
Terry Kim
Wenchen Fan
Hyukjin Kwan
Jacky Lee

*Topics*:

   - Relation resolution behavior doc:
   
https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit
   - Nested schema pruning for v2 (Dongjoon)
   - TableProvider changes
   - Tasks for Spark 3.0
   - Open PRs
      - Nested schema pruning: https://github.com/apache/spark/pull/26751
      - Support FIRST and AFTER in DDL:
      https://github.com/apache/spark/pull/26817
      - Add VACUUM
   - Spark 3.1 goals (if time)

*Discussion*:

   - User-specified schema handling
      - Burak: User-specified schema should
   - Relation resolution behavior (see doc link above)
      - Ryan: Thanks to Terry for fixing table resolution:
      https://github.com/apache/spark/pull/26684, next step is to clean up
      temp views
      - Terry: the idea is to always resolve identifiers the same way and
      not to resolve a temp view in some cases but not others. If an identifier
      is a temp view and is used in a context where you can’t use a view, it
      should fail instead of finding a table.
      - Ryan: does this need to be done by 3.0?
      - *Consensus was that it should be done for 3.0*
      - Ryan: not much activity on the dev list thread for this. Do we move
      forward anyway?
      - Wenchen: okay to fix because the scope is small
      - *Consensus was to go ahead and notify the dev list about changes*
      because this is a low-risk case that does not occur often (table and temp
      view conflict)
      - Burak: cached tables are similar: for insert you get the new
      results.
      - Ryan: is that the same issue or a similar problem to fix?
      - Burak: similar, it can be done separately
      - Ryan: does this also need to be fixed by 3.0?
      - Wenchen: it is a blocker (Yes). Spark should invalidate the cached
      table after a write
      - Ryan: There’s another issue: how do we handle a permanent view with
      a name that resolves to a temp view? If incorrect, this changes
the results
      of a stored view.
      - Wenchen: This is currently broken, Spark will resolve the relation
      as a temp view. But Spark could use the analysis context to fix this.
      - Ryan: We should fix this when fixing temp views.
   - Nested schema pruning:
      - Dongjoon: Nested schema pruning was only done for Parquet and ORC
      instead of all v2 sources. Anton submitted a PR that fixes it.
      - At the time, the PR had +1s and was pending some minor discussion.
      It was merged the next day.
   -

   TableProvider changes:
   - Wenchen: Spark always calls 1 method to load a table. The
      implementation can do schema and partition inference in that method.
      Forcing this to be separated into other methods causes problems
in the file
      source. FileIndex is used for all these tasks.
      - Ryan: I’m not sure that existing file source code is a good enough
      justification to change the proposed API. Seems too path dependent.
      -

      Ryan: It is also strange to have the source of truth for schema
      information differ between code paths. Some getTable uses would pass the
      schema to the source (from metastore) with TableProvider, but some would
      instead rely on the table from getTable to provide its own
schema. This is
      confusing to implementers.
      -

      Burak: The default mode in DataFrameWriter is ErrorIfExists, which
      doesn’t currently work with v2 sources. Moving from Kafka to KafkaV2, for
      example, would probably break.
      - Ryan: So do we want to get extractCatalog and extractIdentifier
      into 3.0? Or is this blocked by the infer changes?
      - Burak: It would be good to have.
      - Wenchen: Schema may be inferred, or provided by Spark
      - Ryan: Sources should specify whether they accept a user-specified
      schema. But either way, the schema is still external and passed into the
      table. The main decision is whether all cases (inference included) should
      pass the schema into the table.
   - Tasks for 3.0
      - Decided to get temp view resolution fixed
      - Decided to get TableProvider changes in
      - extractCatalog/extractIdentifier are nice-to-have (but small)
      - Burak: Upgrading to v2 saveAsTable from DataFrameWriter v1 creates
      RTAS, but in Delta v1 would only overwrite the schema if requested. It
      would be nice to be able to select
      - Ryan: Standardizing behavior (replace vs truncate vs dynamic
      overwrite) is a main point of v2. Allowing sources to choose their own
      behavior is not supported in v2 so that we can guarantee consistent
      semantics across tables. Making a way for Delta to change its semantics
      doesn’t seem like a good idea. To have identical semantics, use
Delta as a
      v1 source.
   - Spark 3.1 goals?
      - Ryan: I suggest metadata columns and metadata tables (e.g.
      partitions) for metadata query optimization
      - Wenchen: ViewCatalog and FunctionCatalog
      - Ryan: will send out a doc on FunctionCatalog when I can write it up
      - Ryan: also MERGE INTO. this will require returning a required
      distribution and sort order.

-- 
Ryan Blue
Software Engineer
Netflix

DSv2 sync notes - 11 December 2019

Reply via email to