DataSourceV2 sync notes - 12 June 2019

Ryan Blue Fri, 14 Jun 2019 17:33:19 -0700

Here are the latest DSv2 sync notes. Please reply with updates or
corrections.


*Attendees*:

Ryan Blue
Michael Armbrust
Gengliang Wang
Matt Cheah
John Zhuge

*Topics*:

Wenchen’s reorganization proposal
Problems with TableProvider - property map isn’t sufficient

New PRs:

   - ReplaceTable: https://github.com/apache/spark/pull/24798
   - V2 Table Resolution: https://github.com/apache/spark/pull/24741
   - V2 Session Catalog: https://github.com/apache/spark/pull/24768

*Discussion*:

   - Wenchen’s organization proposal
      - Ryan: Wenchen proposed using
      `org.apache.spark.sql.connector.{catalog, expressions, read, write,
      extensions}
      - Ryan: I’m not sure we need extensions, but otherwise it looks good
      to me
      - Matt: This is in the catalyst module, right?
      - Ryan: Right. The API is in catalyst. The extensions package would
      be used for any parts that need to be in SQL, but hopefully there aren’t
      any.
      - Consensus was to go with the proposed organization
   - Problems with TableProvider:
      - Gengliang: CREATE TABLE with an ORC v2 table can’t report its
      schema because there are no files
      - Ryan: We hit this when trying to use ORC in SQL unit tests for v2.
      The problem is that the source can’t be passed the schema and other
      information
      - Gengliang: Schema could be passed using the userSpecifiedSchema arg
      - Ryan: The user schema is for cases where the data lacks specific
      types and a user supplies them, like CSV. I don’t think it makes sense to
      reuse that to pass the schema from the catalog
      - Ryan: Other table metadata should be passed as well, like
      partitioning, so sources don’t infer it. I think this requires some
      thought. Anyone want to volunteer?
      - No one volunteered to fix the problem
   - ReplaceTable PR
      - Matt: Needs another update after comments, but about ready
      - Ryan: I agree it is almost ready to commit. I should point out that
      this includes a default implementation that will leave a table deleted if
      the write fails. I think this is expected behavior because REPLACE is a
      DROP combined with CTAS
      - Michael: Sources should be able to opt out of that behavior
      - Ryan: We want to ensure consistent behavior across sources
      - Resolution: sources can implement the staging and throw an
      exception if they choose to opt out
   - V2 table resolution:
      - John: this should be ready to go, only minor comments from Dongjoon
      left
      - This was merged the next day
   - V2 session catalog
      - Ryan: When testing, we realized that if a default catalog is used
      for v2 sources (like ORC v2) then you can run CREATE TABLE, which goes to
      some v2 catalog, but then you can’t load the same table using
the same name
      because the session catalog doesn’t have it.
      - Ryan: To fix this, we need a v2 catalog that delegates to the
      session catalog. This should be used for all v2 operations when
the session
      catalog can’t be used.
      - Ryan: Then the v2 default catalog should be used instead of the
      session catalog when it is set. This provides a smooth
transition from the
      session catalog to v2 catalogs.
   - Gengliang: another topic: decimals
      - Gengliang: v2 doesn’t insert unsafe casts, so literals in SQL
      cannot be inserted to double/float columns
      - Michael: Shouldn’t queries use decimal literals so that floating
      point literals can be floats? What do other databases do?
      - Matt: is this a v2 problem?
      - Ryan: this is not specific to v2 and was discovered when converting
      v1 to use the v2 output rules
      - Ryan: we could add a new decimal type that doesn’t lose data but is
      allowed to be cast because it can only be used for literals where the
      intended type is unknown. There is precedent for this in the parser with
      Hive char and varchar types.
      - Conclusion: This isn’t really a v2 problem
   - Michael: Any work so far on MERGE INTO?
      - Ryan: Not yet, but feel free to make a proposal and start working
      - Ryan: Do you also need to pass extra metadata with each row?
      - Michael: No, this should be delegated to the source
      - Matt: That would be operator push-down
      - Ryan: I agree, that’s operator push-down. It would be great to hear
      how that would work, but I think MERGE INTO should have a default
      implementation. It should be supported across sources instead of in just
      one so we have a reference implementation.
      - Michael: Having only a reference implementation was the problem
      with v1. The behavior should be written down in a spec. Hive has a
      reasonable implementation to follow.
      - Ryan: Yes, but it is still valuable to have a reference
      implementation. And of course a spec is needed.
   - Matt: what does the roadmap look like for finishing in time for Spark
   3.0?
      - Ryan: Looking good if we get v2 table resolution soon. Then major
      features are ALTER TABLE, INSERT INTO, and the new v2 API
      - Ryan: There are also DESCRIBE TABLE, SHOW (TABLES/etc), and others
      that are low priority, but would be nice to get done. Those will require
      the namespace support PR. (Everyone, please contribute here!)

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 12 June 2019

Reply via email to