[
https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sun Rui updated HIVE-6120:
--------------------------
Attachment: group2.diff
> Add GroupBy optimization to eliminate un-needed partial distinct aggregations
> -----------------------------------------------------------------------------
>
> Key: HIVE-6120
> URL: https://issues.apache.org/jira/browse/HIVE-6120
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Sun Rui
> Assignee: Sun Rui
>
> In most cases, partial distinct aggregation is not needed in map-side
> groupby. The exception is that with sorted bucketized tables partial distinct
> aggregation can be done by the mappers in some scenarios, as what is done by
> GroupByOptimzer.
> Currently, partial distinct aggregation is done in the map-side GroupBy and
> then shuffle of the partial result is done in the following ReduceSink
> operator, in cases where they are not needed. This wastes CPU cycles, memory
> and network bandwidth.
> This optimization eliminates un-needed partial distinct aggregations, which
> improves performance and reduces memory usage.
> For example,
> EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;
> Before optimization:
> {noformat}
> Group By Operator
> aggregations:
> expr: count(DISTINCT value)
> bucketGroup: false
> keys:
> expr: key
> type: int
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: int
> tag: -1
> value expressions:
> expr: _col2
> type: bigint
> {noformat}
> After optimization:
> {noformat}
> Group By Operator
> bucketGroup: false
> keys:
> expr: key
> type: int
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: int
> tag: -1
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)