[ https://issues.apache.org/jira/browse/SPARK-40822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653428#comment-17653428 ]
Apache Spark commented on SPARK-40822: -------------------------------------- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/39332 > Use stable derived-column-alias algorithm, suitable for CREATE VIEW > -------------------------------------------------------------------- > > Key: SPARK-40822 > URL: https://issues.apache.org/jira/browse/SPARK-40822 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.4.0 > Reporter: Serge Rielau > Priority: Major > > Spark has the ability derive column aliases for expressions if no alias was > provided by the user. > E.g. > CREATE TABLE T(c1 INT, c2 INT); > SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T); > This is a valuable feature. However, the current implementation works by > pretty printing the expression from the logical plan. This has multiple > downsides: > * The derived names can be unintuitive. For example the brackets in `(c1 + > 1)` or outright ugly, such as: > SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T; > * We cannot guarantee stability across versions since the logical lan of an > expression may change. > The later is a major reason why we cannot allow CREATE VIEW without a column > list except in "trivial" cases. > CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T; > Not allowed to create a permanent view `spark_catalog`.`default`.`v` without > explicitly assigning an alias for expression (c1 + 1). > There are two way we can go about fixing this: > # Stop deriving column aliases from the expression. Instead generate unique > names such as `_col_1` based on their position in the select list. This is > ugly and takes away the "nice" headers on result sets > # Move the derivation of the name upstream. That is instead of pretty > printing the logical plan we pretty print the lexer output, or a sanitized > version of the expression as typed. > The statement as typed is stable by definition. The lexer is stable because i > has no reason to change. And if it ever did we have a better chance to manage > the change. > In this feature we propose the following semantic: > # If the column alias can be trivially derived (some of these can stack), do > so: > ** a (qualified) column reference => the unqualified column identifier > cat.sch.tab.col => col > ** A field reference => the fieldname > struct.field1.field2 => field2 > ** A cast(column AS type) => column > cast(col1 AS INT) => col1 > ** A map lookup with literal key => keyname > map.key => key > map['key'] => key > ** A parameter less function => unqualified function name > current_schema() => current_schema > # Take the lexer tokens of the expression, eliminate comments, and append > them. > foo(tab1.c1 + /* this is a plus*/ > 1) => `foo(tab1.c1+1)` > > Of course we wan this change under a config. > If the config is set we can allow CREATE VIEW to exploit this and use the > derived expressions. > PS: The exact mechanics of formatting the name is very much debatable. > E.g.spaces between token, squeezing out comments - upper casing - preserving > quotes or double quotes...) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org