Louis Fruleux created SPARK-34464: ------------------------------------- Summary: `first` function is sorting the dataset while sometimes it is used to get "any value" Key: SPARK-34464 URL: https://issues.apache.org/jira/browse/SPARK-34464 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Louis Fruleux
When one wants to groupBy and take any value (not necessarily the first), one usually uses [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485] aggregation function. Unfortunately, this method uses a `SortAggregate` for some data types, which is not always necessary and might impact performances. Is this the desired behavior? Current behavior: val df = Seq((0, "value")).toDF("key", "value") df.groupBy("key").agg(first("value")).explain() /* == Physical Plan == SortAggregate(key=[key#342], functions=[first(value#343, false)]) +- *(2) Sort [key#342 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key#342, 200) +- SortAggregate(key=[key#342], functions=[partial_first(value#343, false)]) +- *(1) Sort [key#342 ASC NULLS FIRST], false, 0 +- LocalTableScan [key#342, value#343] */ My understanding of the source code does not allow me to fully understand why this is the current behavior. The solution might be to implement a new aggregate function. But the code would be highly similar to the first one. And if I don't fully understand why this [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45] method falls back to SortAggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org