DataSourceV2 sync notes - 15 May 2019

Ryan Blue Wed, 29 May 2019 15:22:10 -0700

Sorry these notes are so late, I didn’t get to the write up until now. As
usual, if anyone has corrections or comments, please reply.


*Attendees*:

John Zhuge
Ryan Blue
Andrew Long
Wenchen Fan
Gengliang Wang
Russell Spitzer
Yuanjian Li
Yifei Huang
Matt Cheah
Amardeep Singh Dhilon
Zhilmil Dhion
Ryan Pifer

*Topics*:

   - Should Spark require catalogs to report case sensitivity?
   - Bucketing and sorting survey
   - Add default v2 catalog: https://github.com/apache/spark/pull/24594
   - SupportsNamespaces API: https://github.com/apache/spark/pull/24560
   - FunctionCatalog API: https://github.com/apache/spark/pull/24559
   - Skip output column resolution:
   https://github.com/apache/spark/pull/24469
   - Move DSv2 into catalyst module:
   https://github.com/apache/spark/pull/24416
   - Remove SupportsSaveMode: https://github.com/apache/spark/pull/24233

*Discussion*:

   - Wenchen: When will we add select support?
      - John: working in resolution. DSv2 resolution is straight-forward,
      the difficulty is ensuring a smooth transition from v1 to v2.
      - Ryan: table resolution will also be used for inserts. Once select
      is done, insert is next.
      - John: the PR may include insert as well
   - Add default v2 catalog:
      - Ryan: A default catalog is needed fro CTAS support when the source
      is v2
      - Ryan: A pass-through v2 catalog that uses SessionCatalog should be
      available as the default
   - FunctionCatalog API:
      - Wenchen: this should have a design doc
      - Ryan: Agreed. The PR is for early discussion and prototyping.
   - Bucketed joins: [Ed: I don’t remember much of this, feel free to
   expand what was said]
      - Andrew: looks like lots of work to be done for bucketing. Sort
      removals aren’t done, bucketing with non-bucketed tables still incurs
      hashing costs.
      - Ryan: work on support for Hive bucketing appears to have stopped,
      so it doesn’t look like this is an easy area to improve
      - Where should join optimization be done?
      - Andrew will create a prototype PR.
   - Case sensitivity in catalogs: should catalogs report case sensitivity
   to Spark?
      - Ryan: catalogs connect to external systems so Spark can’t impose
      case sensitivity requirements. A catalog is case sensitive or
not and would
      only be forced to violate Spark’s assumption.
      - Ryan: requiring a catalog to report whether it is case sensitive
      doesn’t actually help Spark. If the catalog is case sensitive, then Spark
      should pass exactly what it received to avoid changing the
meaning. If the
      catalog is case insensitive, then Spark can pass exactly what it received
      because case is handled in the catalog. So Spark’s behavior
doesn’t change.
      - Russel: not all catalogs are case sensitive or case insensitive.
      Some are case insensitive unless an identifier is quoted. Quoted
parts are
      case sensitive.
      - Ryan: So a catalog would not be able to return true or false
      correctly.
      - Conclusion: Spark should pass identifiers that it received, without
      modification.

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 15 May 2019

Reply via email to