DSv2 sync notes - 26 June 2019

Ryan Blue Fri, 28 Jun 2019 18:00:09 -0700

Here are my notes from this week’s sync.

*Attendees*:


Ryan Blue
John Zhuge
Dale Richardson
Gabor Somogyi
Matt Cheah
Yifei Huang
Xin Ren
Jose Torres
Gengliang Wang
Kevin Yu

*Topics*:

   - Metadata columns or function push-down for Kafka v2 source
   - Open PRs
      - REPLACE TABLE implementation:
      https://github.com/apache/spark/pull/24798
      - Add v2 AlterTable: https://github.com/apache/spark/pull/24937
      - Add v2 SessionCatalog: https://github.com/apache/spark/pull/24768
      - SupportsNamespaces PR: https://github.com/apache/spark/pull/24560
   - Remaining SQL statements to implement
   - IF NOT EXISTS for INSERT OVERWRITE
   - V2 file source compatibility

*Discussion*:

   - Metadata columns or function push-down for Kafka v2 source
      - Ryan: Kafka v1 source has more read columns than write columns.
      This is to expose metadata like partition, offset, and
timestamp. Those are
      read columns, but not write columns, which is fine in v1. v2 requires a
      table schema
      - Ryan: Two main options to fix this in v2: add metadata columns like
      Presto’s $file or add function push-down similar to Spark’s
      input_file_name(). Metadata columns require less work (expose
      additional columns) but functions are more flexible (can call
      modified_time(col1))
      - Gabor: That’s a fair summary
      - Jose: the problem with input_file_name() is that it can be called
      anywhere, but is only valid in the context of a projection. After a group
      by, it returns empty string, which is odd.
      - Ryan: Couldn’t we handle that case using push-down? It is a
      function defined by a source that can only be run by pushing it down. It
      doesn’t exist after a group by, so analysis would fail if it were used in
      that context, just like columns don’t exist in the group by
context unless
      they were in the grouping expression or created by aggregates.
      - Jose: That would work
      - Ryan: The metadata column approach takes less work, so I think we
      should do that unless someone has the time to drive the function
push-down
      option.
      - Gabor: this is not required to move to v2. Currently working around
      this by not validating the schema. PR:
      https://github.com/apache/spark/pull/24738
      - Mostly consensus around using metadata column approach.
   - REPLACE TABLE PR:
      - Matt: this is mostly ready, just waiting for final reviews
   - AlterTable PR:
      - Gengliang: should this handle complex updates, like replacing a
      struct with a different struct?
      - Ryan: You’re right, that doesn’t make sense. I’ll update the PR
      [Note: done]
   - V2 session catalog PR:
      - Ryan: We talked about this last time. Any objections?
      - Jose: No, this is blocking us right now
   - SupportsNamespaces PR:
      - Ryan: Please look at this, it blocks remaining SQL statements like SHOW
      NAMESPACES, etc.
   - Remaining SQL statements:
      - Ryan: presented a list of remaining SQL statements that need to be
      implemented
      - Important statements (for Spark 3.0):
         - DESCRIBE [FORMATTED|EXTENDED] [TABLE] ...
         - REFRESH TABLE ...
         - SHOW TABLES [IN catalog] [LIKE ...]
         - USE CATALOG ... to set the default catalog
      - Other missing SQL statements; most depend on SupportsNamespaces PR:
         - DESCRIBE [EXTENDED] (NAMESPACE|DATABASE) ...
         - SHOW (NAMESPACES|DATABASES) [IN catalog] [LIKE ...]
         - CREATE (NAMESPACE|DATABASE) ... [PROPERTIES (...)]
         - DROP (NAMESPACE|DATABASE) ...
         - USE ... [IN catalog] to set current namespace in a catalog
      - Matt offered to implement DESCRIBE TABLE
   - IF NOT EXISTS with INSERT INTO
      - John: This validates that overwrite does not overwrite partitions
      and is append only. Should this be supported?
      - Consensus was “why not?” Will add a mix-in trait in a follow-up for
      sources that choose to implement it
   - File source compatibility
      - Ryan: I tried to use built-in sources like Parquet in SQL tests and
      hit problems. Not being able to pass a schema or table partitioning means
      that these tables won’t behave right. What is the plan to get
these sources
      working with SQL?
      - No one has time to work on this
      - Ryan: I’ll write some tests to at least set out the contract so we
      know when the built-in sources are ready to be used.

-- 
Ryan Blue
Software Engineer
Netflix

DSv2 sync notes - 26 June 2019

Reply via email to