DataSourceV2 sync notes - 2 October 2019

Ryan Blue Thu, 10 Oct 2019 16:05:42 -0700

Here are my notes from last week's DSv2 sync.

*Attendees*:


Ryan Blue
Terry Kim
Wenchen Fan

*Topics*:

   - SchemaPruning only supports Parquet and ORC?
   - Out of order optimizer rules
   - 3.0 work
      - Rename session catalog to spark_catalog
      - Finish TableProvider update to avoid another API change: pass all
      table config from metastore
      - Catalog behavior fix:
      https://issues.apache.org/jira/browse/SPARK-29014
      - Stats push-down optimization:
      https://github.com/apache/spark/pull/25955
      - DataFrameWriter v1/v2 compatibility progress
   - Open PRs
      - Update identifier resolution and table resolution:
      https://github.com/apache/spark/pull/25747
      - Expose SerializableConfiguration:
      https://github.com/apache/spark/pull/26005
      - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955

*Discussion*:

   - Update identifier and table resolution
      - Wenchen: Will not handle SPARK-29014, it is a pure refactor
      - Ryan: I think this should separate the v2 rules from the v1
      fallback, to keep table and identifier resolution separate. The only time
      that table resolution needs to be done at the same time is for
v1 fallback.
      - This was merged last week
   - Update to use spark_catalog
      - Wenchen: this will be a separate PR.
      - Now open: https://github.com/apache/spark/pull/26071
   - Early DSv2 pushdown
      - Ryan: this depends on fixing a few more tests. To validate there
      are no calls to computeStats with the DSv2 relation, I’ve temporarily
      removed the method. Other than a few remaining test failures
where the old
      relation was expected, it looks like there are no uses of computeStats
      before early pushdown in the optimizer.
      - Wenchen: agreed that the batch was in the correct place in the
      optimizer
      - Ryan: once tests are passing, will add the computeStats
      implementation back with Utils.isTesting to fail during testing
when called
      before early pushdown, but will not fail at runtime
   - Wenchen: when using v2, there is no way to configure custom options
   for a JDBC table. For v1, the table was created and stored in the session
   catalog, at which point Spark-specific properties like parallelism could be
   stored. In v2, the catalog is the source of truth, so tables don’t get
   created in the same way. Options are only passed in a create statement.
      - Ryan: this could be fixed by allowing users to pass options as
      table properties. We mix the two today, but if we used a prefix for table
      properties, “options.”, then you could use SET TBLPROPERTIES to
get around
      this. That’s also better for compatibility. I’ll open a PR for this.
      - Ryan: this could also be solved by adding an OPTIONS clause or hint
      to SELECT
   - Wenchen: There are commands without v2 statements. We should add v2
   statements to reject non-v1 uses.
      - Ryan: Doesn’t the parser only parse up to 2 identifiers for these?
      That would handle the majority of cases
      - Wenchen: Yes, but there is still a problem for identifiers with 1
      part in v2 catalogs, like catalog.table. Commands that don’t support v2
      will use catalog.table in the v1 catalog.
      - Ryan: Sounds like a good plan to update the parser and add
      statements for these. Do we have a list of commands to update?
      - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, etc.
      Will open an umbrella JIRA with a list.

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 sync notes - 2 October 2019

Reply via email to