Cheng Su created SPARK-34286:
--------------------------------

             Summary: Memory-size-based object hash aggregation fallback 
mechanism
                 Key: SPARK-34286
                 URL: https://issues.apache.org/jira/browse/SPARK-34286
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Cheng Su


As object hash aggregate fallback mechanism is special - it will fallback to 
sort-based aggregation based on number of keys seen so far [0]. This fallback 
logic sometimes is sub-optimal and leads to unnecessary sort, and performance 
degradation in run-time. Given the on-heap object hash aggregation [map value 
is a general 
{{InternalRow}}|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationMap.scala#L36],
 it is hard to get an accurate metrics for memory size of this map. Hive 
aggregation uses a combination of [current JVM heap memory 
usage|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java#L872],
 and [estimation of map 
size|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java#L919-L920]
 to decide whether to spill. Hive's approach might be better. I am also tagging 
queries in our production environment with object hash aggregate fallback, and 
see how to improve them. One thing we can do here is to monitor JVM heap memory 
usage, and estimate the on-heap object hash aggregation map size. Fallback 
based on the information of above, instead of number of keys.

[0]: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L172]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to