[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

Yin Huai (JIRA) Wed, 17 Jul 2013 14:04:22 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711619#comment-13711619
 ]


Yin Huai commented on HIVE-4867:
--------------------------------

Assign to me first. If anyone wants to work on it, feel free to take it.
                
> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-4867
>                 URL: https://issues.apache.org/jira/browse/HIVE-4867
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         store_sales 
>           TableScan
>             alias: store_sales
>             Select Operator
>               expressions:
>                     expr: ss_ticket_number
>                     type: int
>               outputColumnNames: _col0
>               Reduce Output Operator
>                 key expressions:
>                       expr: _col0
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: _col0
>                       type: int
>                 tag: -1
>                 value expressions:
>                       expr: _col0
>                       type: int
>       Reduce Operator Tree:
>         Extract
>           File Output Operator
>             compressed: false
>             GlobalTableId: 0
>             table:
>                 input format: org.apache.hadoop.mapred.TextInputFormat
>                 output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

Reply via email to