Cheng Lian created SPARK-2554: --------------------------------- Summary: CountDistinct and SumDistinct should do partial aggregation Key: SPARK-2554 URL: https://issues.apache.org/jira/browse/SPARK-2554 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian
{{CountDistinct}} and {{SumDistinct}} should first do a partial aggregation and return unique value sets in each partition as partial results. Shuffle IO can be greatly reduced in in cases that there are only a few unique values. -- This message was sent by Atlassian JIRA (v6.2#6252)