Simeon Simeonov created SPARK-26084: ---------------------------------------
Summary: AggregateExpression.references fails on unresolved expression trees Key: SPARK-26084 URL: https://issues.apache.org/jira/browse/SPARK-26084 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Simeon Simeonov [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a stable ordering in {{AttributeSet.toSeq}} using expression IDs ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) without noticing that {{AggregateExpression.references}} used {{AttributeSet.toSeq}} as a shortcut ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). The net result is that {{AggregateExpression.references}} fails for unresolved aggregate functions. {code:scala} org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, isDistinct = false ).references {code} fails with {code:scala} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object, tree: 'y at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) at org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) at org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) at scala.collection.AbstractSeq.sorted(Seq.scala:41) at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) at scala.collection.AbstractSeq.sortBy(Seq.scala:41) at org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) {code} The solution is to avoid calling {{toSeq}} as ordering is not important in {{references}} and simplify (and speed up) the implementation to something like {code:scala} mode match { case Partial | Complete => aggregateFunction.references case PartialMerge | Final => AttributeSet(aggregateFunction.aggBufferAttributes) } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org