[
https://issues.apache.org/jira/browse/HIVE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780015#comment-13780015
]
Yin Huai commented on HIVE-5358:
--------------------------------
My last example was not good... Let me try another example. The query may not
make much sense, but I hope it can make the problem clear.
{code}
select c3, c2 from (select c1, c2, c3, c4 from t2 group by c1, c2, c3, c4) t
group by c3, c2;
{code}
For the first GBY, we want to group rows based on [c1, c2, c3, c4] and then we
want to group the output of the firs GBY based on [c3, c2]. We can use [c2, c3]
as the partitioning columns to make sure rows will be distributed in a correct
way. Then, if we use [c3, c2] as the sorting columns (key columns in RS), c1
and c4 will be in the value columns of RS. Seems we need to also adjust the
first GBY to construct its key from both key and value of the reduce input. If
we use [c1, c2, c3, c4] as the sorting columns, seems we need to introduce a
sort operator to generate row groups based on [c3, c2].
I am also attaching the plan generated by your .2 patch
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t:t2
TableScan
alias: t2
Select Operator
expressions:
expr: c1
type: int
expr: c2
type: int
expr: c3
type: int
expr: c4
type: int
outputColumnNames: c1, c2, c3, c4
Group By Operator
bucketGroup: false
keys:
expr: c1
type: int
expr: c2
type: int
expr: c3
type: int
expr: c4
type: int
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Reduce Output Operator
key expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: int
expr: _col3
type: int
sort order: ++++
Map-reduce partition columns:
expr: _col2
type: int
expr: _col1
type: int
tag: -1
Reduce Operator Tree:
Group By Operator
bucketGroup: false
keys:
expr: KEY._col0
type: int
expr: KEY._col1
type: int
expr: KEY._col2
type: int
expr: KEY._col3
type: int
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Select Operator
expressions:
expr: _col2
type: int
expr: _col1
type: int
outputColumnNames: _col2, _col1
Group By Operator
bucketGroup: false
keys:
expr: _col2
type: int
expr: _col1
type: int
mode: complete
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
{code}
> ReduceSinkDeDuplication should ignore column orders when check overlapping
> part of keys between parent and child
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5358
> URL: https://issues.apache.org/jira/browse/HIVE-5358
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Chun Chen
> Assignee: Chun Chen
> Attachments: D13113.1.patch, HIVE-5358.2.patch, HIVE-5358.patch
>
>
> {code}
> select key, value from (select key, value from src group by key, value) t
> group by key, value;
> {code}
> This can be optimized by ReduceSinkDeDuplication
> {code}
> select key, value from (select key, value from src group by key, value) t
> group by value, key;
> {code}
> However the sql above can't be optimized by ReduceSinkDeDuplication currently
> due to different column orders of parent and child operator.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira