[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Hao updated SPARK-4367: ----------------------------- Summary: 2 Phase-shuffle to optimize the DISTINCT aggregation (was: Process the "distinct" value before shuffling for aggregation) > 2 Phase-shuffle to optimize the DISTINCT aggregation > ---------------------------------------------------- > > Key: SPARK-4367 > URL: https://issues.apache.org/jira/browse/SPARK-4367 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Cheng Hao > > Most of aggregate function(e.g average) with "distinct" value will requires > all of the records in the same group to be shuffled into a single node, > however, as part of the optimization, those records can be partially > aggregated before shuffling, that probably reduces the overhead of shuffling > significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org