[ https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753310#comment-16753310 ]
Kris Mok commented on SPARK-26741: ---------------------------------- Note that there's a similar issue with non-aggregate functions. Here's an example: {code:none} spark.sql("create table foo (id int, blob binary)") val df = spark.sql("select length(blob) from foo where id = 1 order by length(blob) limit 10") df.explain(true) {code} {code:none} == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Sort ['length('blob) ASC NULLS FIRST], true +- 'Project [unresolvedalias('length('blob), None)] +- 'Filter ('id = 1) +- 'UnresolvedRelation `foo` == Analyzed Logical Plan == length(blob): int GlobalLimit 10 +- LocalLimit 10 +- Project [length(blob)#25] +- Sort [length(blob#24) ASC NULLS FIRST], true +- Project [length(blob#24) AS length(blob)#25, blob#24] +- Filter (id#23 = 1) +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] == Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Project [length(blob)#25] +- Sort [length(blob#24) ASC NULLS FIRST], true +- Project [length(blob#24) AS length(blob)#25, blob#24] +- Filter (isnotnull(id#23) && (id#23 = 1)) +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[length(blob#24) ASC NULLS FIRST], output=[length(blob)#25]) +- *(1) Project [length(blob#24) AS length(blob)#25, blob#24] +- *(1) Filter (isnotnull(id#23) && (id#23 = 1)) +- Scan hive default.foo [blob#24, id#23], HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] {code} Note how the {{Sort}} operator performs the {{length()}} again, despite there's one in the projection right below it. The root cause of this problem in the Analyzer is the same as the main example in this ticket, although this example is not as harmful (at least it still runs...) > Analyzer incorrectly resolves aggregate function outside of Aggregate > operators > ------------------------------------------------------------------------------- > > Key: SPARK-26741 > URL: https://issues.apache.org/jira/browse/SPARK-26741 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Kris Mok > Priority: Major > > The analyzer can sometimes hit issues with resolving functions. e.g. > {code:sql} > select max(id) > from range(10) > group by id > having count(1) >= 1 > order by max(id) > {code} > The analyzed plan of this query is: > {code:none} > == Analyzed Logical Plan == > max(id): bigint > Project [max(id)#91L] > +- Sort [max(id#88L) ASC NULLS FIRST], true > +- Project [max(id)#91L, id#88L] > +- Filter (count(1)#93L >= cast(1 as bigint)) > +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS > count(1)#93L, id#88L] > +- Range (0, 10, step=1, splits=None) > {code} > Note how an aggregate function is outside of {{Aggregate}} operators in the > fully analyzed plan: > {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid. > Trying to run this query will lead to weird issues in codegen, but the root > cause is in the analyzer: > {code:none} > java.lang.UnsupportedOperationException: Cannot generate code for expression: > max(input[1, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.<init>(GenerateOrdering.scala:195) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.<init>(GenerateOrdering.scala:192) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2470) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2684) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:299) > at org.apache.spark.sql.Dataset.show(Dataset.scala:753) > at org.apache.spark.sql.Dataset.show(Dataset.scala:712) > at org.apache.spark.sql.Dataset.show(Dataset.scala:721) > {code} > The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar > subquery)}} in {{SubquerySuite}} has been disabled because of hitting this > issue, caught by SPARK-26735. We should re-enable that test once this bug is > fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org