[ https://issues.apache.org/jira/browse/SPARK-41236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17638493#comment-17638493 ]
huldar chen edited comment on SPARK-41236 at 11/25/22 4:45 AM: --------------------------------------------------------------- If the aggregated column to be queried is renamed to the name of an existing column in the table, when parsing the aggregated column, it will be bound to the original column ID in the table, resulting in an exception. Should be parsed from aggregateExpressions first. like cases case 1: {code:java} select collect_set(testdata2.a) as a from testdata2 group by b having size(a) > 0 {code} analyzedPlan is: {code:java} 'Filter (size(tempresolvedcolumn(a#3, a), true) > 0) +- Aggregate [b#4], [collect_set(a#3, 0, 0) AS a#44] +- SubqueryAlias testdata2 +- View (`testData2`, [a#3,b#4]) +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4] +- ExternalRDD [obj#2] {code} tempresolvedcolumn(a#3, a) should bind a#44. case 2: {code:java} select collect_set(testdata2.a) as b from testdata2 group by b having size(b) > 0 {code} analyzedPlan is: {code:java} 'Project [b#44] +- 'Filter (size(tempresolvedcolumn(b#4, b), true) > 0) +- Aggregate [b#4], [collect_set(a#3, 0, 0) AS b#44, b#4] +- SubqueryAlias testdata2 +- View (`testData2`, [a#3,b#4]) +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4] +- ExternalRDD [obj#2] {code} tempresolvedcolumn(b#4, b) should bind b#44. The buggy code is in org.apache.spark.sql.catalyst.analysis.Analyzer.ResolveAggregateFunctions#resolveExprsWithAggregate# resolveCol was (Author: huldar): If the aggregated column to be queried is renamed to the name of an existing column in the table, when parsing the aggregated column, it will be bound to the original column ID in the table, resulting in an exception. Should be parsed from aggregateExpressions first. like cases case 1: {code:java} select collect_set(testdata2.a) as a from testdata2 group by b having size(a) > 0 {code} analyzedPlan is: {code:java} 'Filter (size(tempresolvedcolumn(a#3, a), true) > 0) +- Aggregate [b#4], [collect_set(a#3, 0, 0) AS a#44] +- SubqueryAlias testdata2 +- View (`testData2`, [a#3,b#4]) +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4] +- ExternalRDD [obj#2] {code} tempresolvedcolumn(a#3, a) should bind a#44. case 2: {code:java} select collect_set(testdata2.a) as b from testdata2 group by b having size(b) > 0 {code} analyzedPlan is: {code:java} 'Project [b#44] +- 'Filter (size(tempresolvedcolumn(b#4, b), true) > 0) +- Aggregate [b#4], [collect_set(a#3, 0, 0) AS b#44, b#4] +- SubqueryAlias testdata2 +- View (`testData2`, [a#3,b#4]) +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#4] +- ExternalRDD [obj#2] {code} tempresolvedcolumn(b#4, b) should bind b#44. The buggy code is inorg.apache.spark.sql.catalyst.analysis.Analyzer.ResolveAggregateFunctions#resolveExprsWithAggregate# resolveCol > The renamed field name cannot be recognized after group filtering > ----------------------------------------------------------------- > > Key: SPARK-41236 > URL: https://issues.apache.org/jira/browse/SPARK-41236 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: jingxiong zhong > Priority: Major > > {code:java} > select collect_set(age) as age > from db_table.table1 > group by name > having size(age) > 1 > {code} > a simple sql, it work well in spark2.4, but doesn't work in spark3.2.0 > Is it a bug or a new standard? > h3. *like this:* > {code:sql} > create db1.table1(age int, name string); > insert into db1.table1 values(1, 'a'); > insert into db1.table1 values(2, 'b'); > insert into db1.table1 values(3, 'c'); > --then run sql like this > select collect_set(age) as age from db1.table1 group by name having size(age) > > 1 ; > {code} > h3. Stack Information > org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input > columns: [age]; line 4 pos 12; > 'Filter (size('age, true) > 1) > +- Aggregate [name#2], [collect_set(age#1, 0, 0) AS age#0] > +- SubqueryAlias spark_catalog.db1.table1 > +- HiveTableRelation [`db1`.`table1`, > org.apache.hadoop.hive.ql.io.orc.OrcSerde, Data Cols: [age#1, name#2], > Partition Cols: []] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:179) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:175) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$2(TreeNode.scala:535) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:535) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1128) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1127) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:467) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:532) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1154) > at > org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1153) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.mapChildren(Expression.scala:555) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:532) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUpWithPruning$1(QueryPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:323) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:214) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUpWithPruning(QueryPlan.scala:181) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:161) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:175) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:263) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:172) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:196) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:192) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:88) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:88) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:78) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) > at > org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org