[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629368#comment-15629368
 ] 

Nattavut Sutyanyong commented on SPARK-18209:
---------------------------------------------

I could think of another alternative but IMHO, it's more convoluted option.

If, for the reason of wanting to save the compilation time, we want to keep the 
LogicalPlan of the view in the metastore (just use a singular noun here for 
simplicity but does not mean to limit to a plural meaning, if we want to have 
more than one metastore). The expansion of the view is just merely attaching 
the already compiled LogicalPlan to the one representing the SQL referencing to 
the view. What we are facing here is to manage the object dependencies and to 
perform a recompilation of the view if any objects the view depends on changed 
their definitions. Even worse, if the definition of class LogicalPlan changes 
in the future, we will need to do the migration of the LogicalPlan stored in 
the metastore to the new definition.

As such, I think storing the original SQL statement of the view definition in 
the metastore is a cleaner solution. No one would disagree that SQL language is 
certainly more stable than the definition of class LogicalPlan.

> More robust view canonicalization without full SQL expansion
> ------------------------------------------------------------
>
>                 Key: SPARK-18209
>                 URL: https://issues.apache.org/jira/browse/SPARK-18209
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to