[
https://issues.apache.org/jira/browse/HIVE-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858362#comment-13858362
]
Hive QA commented on HIVE-6120:
-------------------------------
{color:red}Overall{color}: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12620770/HIVE-6120.1.patch
{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 4818 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_8
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby2
org.apache.hadoop.hive.ql.parse.TestParse.testParse_groupby3
{noformat}
Test results:
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/testReport
Console output:
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/764/console
Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}
This message is automatically generated.
ATTACHMENT ID: 12620770
> Add GroupBy optimization to eliminate un-needed partial distinct aggregations
> -----------------------------------------------------------------------------
>
> Key: HIVE-6120
> URL: https://issues.apache.org/jira/browse/HIVE-6120
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Sun Rui
> Assignee: Sun Rui
> Attachments: HIVE-6120.1.patch
>
>
> In most cases, partial distinct aggregation is not needed in map-side
> groupby. The exception is that with sorted bucketized tables partial distinct
> aggregation can be done by the mappers in some scenarios, as what is done by
> GroupByOptimzer.
> Currently, partial distinct aggregation is done in the map-side GroupBy and
> then shuffle of the partial result is done in the following ReduceSink
> operator, in cases where they are not needed. This wastes CPU cycles, memory
> and network bandwidth.
> This optimization eliminates un-needed partial distinct aggregations, which
> improves performance and reduces memory usage.
> For example,
> EXPLAIN SELECT key, count(DISTINCT value) FROM src GROUP BY key;
> Before optimization:
> {noformat}
> Group By Operator
> aggregations:
> expr: count(DISTINCT value)
> bucketGroup: false
> keys:
> expr: key
> type: int
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: int
> tag: -1
> value expressions:
> expr: _col2
> type: bigint
> {noformat}
> After optimization:
> {noformat}
> Group By Operator
> bucketGroup: false
> keys:
> expr: key
> type: int
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: int
> tag: -1
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)