[jira] [Commented] (HIVE-5358) ReduceSinkDeDuplication should ignore column orders when check overlapping part of keys between parent and child

Yin Huai (JIRA) Fri, 27 Sep 2013 08:20:57 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780015#comment-13780015
 ]


Yin Huai commented on HIVE-5358:
--------------------------------

My last example was not good... Let me try another example. The query may not 
make much sense, but I hope it can make the problem clear.

{code}
select c3, c2 from (select c1, c2, c3, c4 from t2 group by c1, c2, c3, c4) t 
group by c3, c2;
{code}

For the first GBY, we want to group rows based on [c1, c2, c3, c4] and then we 
want to group the output of the firs GBY based on [c3, c2]. We can use [c2, c3] 
as the partitioning columns to make sure rows will be distributed in a correct 
way. Then, if we use [c3, c2] as the sorting columns (key columns in RS), c1 
and c4 will be in the value columns of RS. Seems we need to also adjust the 
first GBY to construct its key from both key and value of the reduce input. If 
we use [c1, c2, c3, c4] as the sorting columns, seems we need to introduce a 
sort operator to generate row groups based on [c3, c2].

I am also attaching the plan generated by your .2 patch
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        t:t2 
          TableScan
            alias: t2
            Select Operator
              expressions:
                    expr: c1
                    type: int
                    expr: c2
                    type: int
                    expr: c3
                    type: int
                    expr: c4
                    type: int
              outputColumnNames: c1, c2, c3, c4
              Group By Operator
                bucketGroup: false
                keys:
                      expr: c1
                      type: int
                      expr: c2
                      type: int
                      expr: c3
                      type: int
                      expr: c4
                      type: int
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: int
                        expr: _col1
                        type: int
                        expr: _col2
                        type: int
                        expr: _col3
                        type: int
                  sort order: ++++
                  Map-reduce partition columns:
                        expr: _col2
                        type: int
                        expr: _col1
                        type: int
                  tag: -1
      Reduce Operator Tree:
        Group By Operator
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
                expr: KEY._col1
                type: int
                expr: KEY._col2
                type: int
                expr: KEY._col3
                type: int
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Select Operator
            expressions:
                  expr: _col2
                  type: int
                  expr: _col1
                  type: int
            outputColumnNames: _col2, _col1
            Group By Operator
              bucketGroup: false
              keys:
                    expr: _col2
                    type: int
                    expr: _col1
                    type: int
              mode: complete
              outputColumnNames: _col0, _col1
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                outputColumnNames: _col0, _col1
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
{code}

                
> ReduceSinkDeDuplication should ignore column orders when check overlapping 
> part of keys between parent and child
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-5358
>                 URL: https://issues.apache.org/jira/browse/HIVE-5358
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Chun Chen
>            Assignee: Chun Chen
>         Attachments: D13113.1.patch, HIVE-5358.2.patch, HIVE-5358.patch
>
>
> {code}
> select key, value from (select key, value from src group by key, value) t 
> group by key, value;
> {code}
> This can be optimized by ReduceSinkDeDuplication
> {code}
> select key, value from (select key, value from src group by key, value) t 
> group by value, key;
> {code}
> However the sql above can't be optimized by ReduceSinkDeDuplication currently 
> due to different column orders of parent and child operator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-5358) ReduceSinkDeDuplication should ignore column orders when check overlapping part of keys between parent and child

Reply via email to