Cheng Su created SPARK-34286: -------------------------------- Summary: Memory-size-based object hash aggregation fallback mechanism Key: SPARK-34286 URL: https://issues.apache.org/jira/browse/SPARK-34286 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su
As object hash aggregate fallback mechanism is special - it will fallback to sort-based aggregation based on number of keys seen so far [0]. This fallback logic sometimes is sub-optimal and leads to unnecessary sort, and performance degradation in run-time. Given the on-heap object hash aggregation [map value is a general {{InternalRow}}|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationMap.scala#L36], it is hard to get an accurate metrics for memory size of this map. Hive aggregation uses a combination of [current JVM heap memory usage|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java#L872], and [estimation of map size|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java#L919-L920] to decide whether to spill. Hive's approach might be better. I am also tagging queries in our production environment with object hash aggregate fallback, and see how to improve them. One thing we can do here is to monitor JVM heap memory usage, and estimate the on-heap object hash aggregation map size. Fallback based on the information of above, instead of number of keys. [0]: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L172] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org