[jira] [Updated] (SPARK-21538) Attribute resolution inconsistency in Dataset API
[ https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21538: Affects Version/s: (was: 3.0.0) 2.3.0 > Attribute resolution inconsistency in Dataset API > - > > Key: SPARK-21538 > URL: https://issues.apache.org/jira/browse/SPARK-21538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu > > {code} > spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works > spark.range(1).withColumnRenamed("id", "x").sort($"id") // works > spark.range(1).withColumnRenamed("id", "x").sort('id) // works > spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: > org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among > (x); > ... > {code} > It looks like the Dataset API functions taking {{String}} use the basic > resolver that only look at the columns at that level, whereas all the other > means of expressing an attribute are lazily resolved during the analyzer. > The reason why the first 3 calls work is explained in the docs for {{object > ResolveMissingReferences}}: > {code} > /** >* In many dialects of SQL it is valid to sort by attributes that are not > present in the SELECT >* clause. This rule detects such queries and adds the required attributes > to the original >* projection, so that they will be available during sorting. Another > projection is added to >* remove these attributes after sorting. >* >* The HAVING clause could also used a grouping columns that is not > presented in the SELECT. >*/ > {code} > For consistency, it would be good to use the same attribute resolution > mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21538) Attribute resolution inconsistency in Dataset API
[ https://issues.apache.org/jira/browse/SPARK-21538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21538: Issue Type: Improvement (was: Story) > Attribute resolution inconsistency in Dataset API > - > > Key: SPARK-21538 > URL: https://issues.apache.org/jira/browse/SPARK-21538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu > > {code} > spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works > spark.range(1).withColumnRenamed("id", "x").sort($"id") // works > spark.range(1).withColumnRenamed("id", "x").sort('id) // works > spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with: > org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among > (x); > ... > {code} > It looks like the Dataset API functions taking {{String}} use the basic > resolver that only look at the columns at that level, whereas all the other > means of expressing an attribute are lazily resolved during the analyzer. > The reason why the first 3 calls work is explained in the docs for {{object > ResolveMissingReferences}}: > {code} > /** >* In many dialects of SQL it is valid to sort by attributes that are not > present in the SELECT >* clause. This rule detects such queries and adds the required attributes > to the original >* projection, so that they will be available during sorting. Another > projection is added to >* remove these attributes after sorting. >* >* The HAVING clause could also used a grouping columns that is not > presented in the SELECT. >*/ > {code} > For consistency, it would be good to use the same attribute resolution > mechanism everywhere. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org