Simeon Simeonov created SPARK-26084:
---------------------------------------

             Summary: AggregateExpression.references fails on unresolved 
expression trees
                 Key: SPARK-26084
                 URL: https://issues.apache.org/jira/browse/SPARK-26084
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: Simeon Simeonov


[SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
stable ordering in {{AttributeSet.toSeq}} using expression IDs 
([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
 without noticing that {{AggregateExpression.references}} used 
{{AttributeSet.toSeq}} as a shortcut 
([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
 The net result is that {{AggregateExpression.references}} fails for unresolved 
aggregate functions.

{code:scala}
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
  org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
  mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
  isDistinct = false
).references
{code}

fails with

{code:scala}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
exprId on unresolved object, tree: 'y
        at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
        at 
org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
        at 
org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
        at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
        at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
        at java.util.TimSort.sort(TimSort.java:220)
        at java.util.Arrays.sort(Arrays.java:1438)
        at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
        at scala.collection.AbstractSeq.sorted(Seq.scala:41)
        at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
        at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
        at 
org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
        at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
{code}

The solution is to avoid calling {{toSeq}} as ordering is not important in 
{{references}} and simplify (and speed up) the implementation to something like

{code:scala}
mode match {
  case Partial | Complete => aggregateFunction.references
  case PartialMerge | Final => 
AttributeSet(aggregateFunction.aggBufferAttributes)
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to