Here are my notes from last week's DSv2 sync. *Attendees*:
Ryan Blue Terry Kim Wenchen Fan *Topics*: - SchemaPruning only supports Parquet and ORC? - Out of order optimizer rules - 3.0 work - Rename session catalog to spark_catalog - Finish TableProvider update to avoid another API change: pass all table config from metastore - Catalog behavior fix: https://issues.apache.org/jira/browse/SPARK-29014 - Stats push-down optimization: https://github.com/apache/spark/pull/25955 - DataFrameWriter v1/v2 compatibility progress - Open PRs - Update identifier resolution and table resolution: https://github.com/apache/spark/pull/25747 - Expose SerializableConfiguration: https://github.com/apache/spark/pull/26005 - Early DSv2 pushdown: https://github.com/apache/spark/pull/25955 *Discussion*: - Update identifier and table resolution - Wenchen: Will not handle SPARK-29014, it is a pure refactor - Ryan: I think this should separate the v2 rules from the v1 fallback, to keep table and identifier resolution separate. The only time that table resolution needs to be done at the same time is for v1 fallback. - This was merged last week - Update to use spark_catalog - Wenchen: this will be a separate PR. - Now open: https://github.com/apache/spark/pull/26071 - Early DSv2 pushdown - Ryan: this depends on fixing a few more tests. To validate there are no calls to computeStats with the DSv2 relation, I’ve temporarily removed the method. Other than a few remaining test failures where the old relation was expected, it looks like there are no uses of computeStats before early pushdown in the optimizer. - Wenchen: agreed that the batch was in the correct place in the optimizer - Ryan: once tests are passing, will add the computeStats implementation back with Utils.isTesting to fail during testing when called before early pushdown, but will not fail at runtime - Wenchen: when using v2, there is no way to configure custom options for a JDBC table. For v1, the table was created and stored in the session catalog, at which point Spark-specific properties like parallelism could be stored. In v2, the catalog is the source of truth, so tables don’t get created in the same way. Options are only passed in a create statement. - Ryan: this could be fixed by allowing users to pass options as table properties. We mix the two today, but if we used a prefix for table properties, “options.”, then you could use SET TBLPROPERTIES to get around this. That’s also better for compatibility. I’ll open a PR for this. - Ryan: this could also be solved by adding an OPTIONS clause or hint to SELECT - Wenchen: There are commands without v2 statements. We should add v2 statements to reject non-v1 uses. - Ryan: Doesn’t the parser only parse up to 2 identifiers for these? That would handle the majority of cases - Wenchen: Yes, but there is still a problem for identifiers with 1 part in v2 catalogs, like catalog.table. Commands that don’t support v2 will use catalog.table in the v1 catalog. - Ryan: Sounds like a good plan to update the parser and add statements for these. Do we have a list of commands to update? - Wenchen: REFRESH TABLE, ANALYZE TABLE, ALTER TABLE PARTITION, etc. Will open an umbrella JIRA with a list. -- Ryan Blue Software Engineer Netflix