[ 
https://issues.apache.org/jira/browse/SPARK-40822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40822.
------------------------------
    Fix Version/s: 3.5.0
       Resolution: Fixed

Issue resolved by pull request 40126
[https://github.com/apache/spark/pull/40126]

> Use stable derived-column-alias algorithm, suitable for CREATE VIEW 
> --------------------------------------------------------------------
>
>                 Key: SPARK-40822
>                 URL: https://issues.apache.org/jira/browse/SPARK-40822
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: Serge Rielau
>            Assignee: Max Gekk
>            Priority: Major
>             Fix For: 3.5.0
>
>
> Spark has the ability derive column aliases for expressions if no alias was 
> provided by the user.
> E.g.
> CREATE TABLE T(c1 INT, c2 INT);
> SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T);
> This is a valuable feature. However, the current implementation works by 
> pretty printing the expression from the logical plan.  This has multiple 
> downsides:
>  * The derived names can be unintuitive. For example the brackets in `(c1 + 
> 1)` or outright ugly, such as:
> SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T;
>  * We cannot guarantee stability across versions since the logical lan of an 
> expression may change.
> The later is a major reason why we cannot allow CREATE VIEW without a column 
> list except in "trivial" cases.
> CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T;
> Not allowed to create a permanent view `spark_catalog`.`default`.`v` without 
> explicitly assigning an alias for expression (c1 + 1).
> There are two way we can go about fixing this:
>  # Stop deriving column aliases from the expression. Instead generate unique 
> names such as `_col_1` based on their position in the select list. This is 
> ugly and takes away the "nice" headers on result sets
>  # Move the derivation of the name upstream. That is instead of pretty 
> printing the logical plan we pretty print the lexer output, or a sanitized 
> version of the expression as typed.
> The statement as typed is stable by definition. The lexer is stable because i 
> has no reason to change. And if it ever did we have a better chance to manage 
> the change.
> In this feature we propose the following semantic:
>  # If the column alias can be trivially derived (some of these can stack), do 
> so:
>  ** a (qualified) column reference => the unqualified column identifier
> cat.sch.tab.col => col
>  ** A field reference => the fieldname
> struct.field1.field2 => field2
>  ** A cast(column AS type) => column
> cast(col1 AS INT) => col1
>  ** A map lookup with literal key => keyname
> map.key => key
> map['key'] => key
>  ** A parameter less function => unqualified function name
> current_schema() => current_schema
>  # Take the lexer tokens of the expression, eliminate comments, and append 
> them.
> foo(tab1.c1 + /* this is a plus*/
> 1) => `foo(tab1.c1+1)`
>  
> Of course we wan this change under a config.
> If the config is set we can allow CREATE VIEW to exploit this and use the 
> derived expressions.
> PS: The exact mechanics of formatting the name is very much debatable. 
> E.g.spaces between token, squeezing out comments - upper casing - preserving 
> quotes or double quotes...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to