Sounds like a good idea?

Would this be a step in the direction of supporting variation of the SQL 
dialect, too?


________________________________
From: Ryan Blue <rb...@netflix.com.invalid>
Sent: Thursday, October 4, 2018 8:56 AM
To: Spark Dev List
Subject: Spark SQL parser and DDL


Hi everyone,

I’ve been working on SQL DDL statements for v2 tables lately, including the 
proposed additions to drop, rename, and alter columns. The most recent update 
I’ve added is to allow transformation functions in the PARTITION BY clause to 
pass to v2 data sources. This allows sources like Iceberg to do partition 
pruning internally.

One of the difficulties has been that the SQL parser is coupled to the current 
logical plans and includes details that are specific to them. For example, data 
source table creation makes determinations like the EXTERNAL keyword is not 
allowed and instead the mode (external or managed) is set depending on whether 
a path is set. It also translates IF NOT EXISTS into a SaveMode and introduces 
a few other transformations.

The main problem with this is that converting the SQL plans produced by the 
parser to v2 plans requires interpreting these alterations and not the original 
SQL. Another consequence is that there are two parsers: AstBuilder in 
spark-catalyst and SparkSqlParser in spark-sql (core) because not all of the 
plans are available to the parser in the catalyst module.

I think it would be cleaner if we added a sql package with catalyst plans that 
carry the SQL options as they were parsed, and then convert those plans to 
specific implementations depending on the tables that are used. That makes 
support for v2 plans much cleaner by converting from a generic SQL plan instead 
of creating a v1 plan that assumes a data source table and then converting that 
to a v2 plan (playing telephone with logical plans).

This has simplified the work I’ve been doing to add PARTITION BY 
transformations. Instead of needing to add transformations to the CatalogTable 
metadata that’s used everywhere, this only required a change to the rule that 
converts from the parsed SQL plan to CatalogTable-based v1 plans. It is also 
cleaner to have the logic for converting to CatalogTable in DataSourceAnalysis 
instead of in the parser itself.

Are there objections to this approach for integrating v2 plans?

--
Ryan Blue
Software Engineer
Netflix

Reply via email to