[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Navis updated HIVE-4867: ------------------------ Attachment: (was: HIVE-4867.1.patch.txt) > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --------------------------------------------------------------------------------------- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement > Reporter: Yin Huai > Assignee: Yin Huai > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)