Yin Huai created HIVE-4867:
------------------------------

             Summary: Deduplicate columns appearing in both the key list and 
value list of ReduceSinkOperator
                 Key: HIVE-4867
                 URL: https://issues.apache.org/jira/browse/HIVE-4867
             Project: Hive
          Issue Type: Improvement
            Reporter: Yin Huai


A ReduceSinkOperator emits data in the format of keys and values. Right now, a 
column may appear in both the key list and value list, which result in 
unnecessary overhead for shuffling. 

Example:
We have a query shown below ...
{code:sql}
explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
{\code}

The plan is ...
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        store_sales 
          TableScan
            alias: store_sales
            Select Operator
              expressions:
                    expr: ss_ticket_number
                    type: int
              outputColumnNames: _col0
              Reduce Output Operator
                key expressions:
                      expr: _col0
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: _col0
                      type: int
                tag: -1
                value expressions:
                      expr: _col0
                      type: int
      Reduce Operator Tree:
        Extract
          File Output Operator
            compressed: false
            GlobalTableId: 0
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

{\code}

The column 'ss_ticket_number' is in both the key list and value list of the 
ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
BinarySortableSerDe will introduce 1 byte more for every int in the key. 
LazyBinarySerDe will also introduce overhead when recording the length of a 
int. For every int, 10 bytes should be a rough estimation of the size of data 
emitted from the Map phase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to