[
https://issues.apache.org/jira/browse/HIVE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai updated HIVE-5357:
---------------------------
Description:
Example:
{code}
select key, count(distinct value) from (select key, value from src group by
key, value) t group by key;
//result
0 0 NULL
10 10 NULL
100 100 NULL
103 103 NULL
104 104 NULL
{code}
Obviously the result is wrong.
When we have a simple group by query with a distinct column
{code}
explain select count(distinct value) from src group by key;
{code}
The plan is
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src
TableScan
alias: src
Select Operator
expressions:
expr: key
type: string
expr: value
type: string
outputColumnNames: key, value
Group By Operator
aggregations:
expr: count(DISTINCT value)
bucketGroup: false
keys:
expr: key
type: string
expr: value
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
sort order: ++
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col1
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
{code}
The map side GBY also adds the distinct columns (value in this case) to its key
columns.
When RSDedup optimizes a query involving a GBY with distinct keys, if map-side
aggregation is enabled, currently it assigns the map-side GBY's key columns to
the reduce-side GBY. So, for the example shown at the beginning, after we
generate a plan with a single MR job, the second GBY in the reduce-side uses
both key and value as its key columns. The correct key column is key.
was:
{code}
select key, count(distinct value) from (select key, value from src group by
key, value) t group by key;
//result
0 0 NULL
10 10 NULL
100 100 NULL
103 103 NULL
104 104 NULL
{code}
Obviously the result is wrong.
> ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr
> scenario when there are distinct keys in child GBY
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5357
> URL: https://issues.apache.org/jira/browse/HIVE-5357
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.11.0
> Reporter: Chun Chen
> Assignee: Chun Chen
> Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: HIVE-5357.patch
>
>
> Example:
> {code}
> select key, count(distinct value) from (select key, value from src group by
> key, value) t group by key;
> //result
> 0 0 NULL
> 10 10 NULL
> 100 100 NULL
> 103 103 NULL
> 104 104 NULL
> {code}
> Obviously the result is wrong.
> When we have a simple group by query with a distinct column
> {code}
> explain select count(distinct value) from src group by key;
> {code}
> The plan is
> {code}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> src
> TableScan
> alias: src
> Select Operator
> expressions:
> expr: key
> type: string
> expr: value
> type: string
> outputColumnNames: key, value
> Group By Operator
> aggregations:
> expr: count(DISTINCT value)
> bucketGroup: false
> keys:
> expr: key
> type: string
> expr: value
> type: string
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Reduce Output Operator
> key expressions:
> expr: _col0
> type: string
> expr: _col1
> type: string
> sort order: ++
> Map-reduce partition columns:
> expr: _col0
> type: string
> tag: -1
> value expressions:
> expr: _col2
> type: bigint
> Reduce Operator Tree:
> Group By Operator
> aggregations:
> expr: count(DISTINCT KEY._col1:0._col0)
> bucketGroup: false
> keys:
> expr: KEY._col0
> type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
> expressions:
> expr: _col1
> type: bigint
> outputColumnNames: _col0
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {code}
> The map side GBY also adds the distinct columns (value in this case) to its
> key columns.
> When RSDedup optimizes a query involving a GBY with distinct keys, if
> map-side aggregation is enabled, currently it assigns the map-side GBY's key
> columns to the reduce-side GBY. So, for the example shown at the beginning,
> after we generate a plan with a single MR job, the second GBY in the
> reduce-side uses both key and value as its key columns. The correct key
> column is key.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira