Yin Huai created HIVE-4867:
------------------------------
Summary: Deduplicate columns appearing in both the key list and
value list of ReduceSinkOperator
Key: HIVE-4867
URL: https://issues.apache.org/jira/browse/HIVE-4867
Project: Hive
Issue Type: Improvement
Reporter: Yin Huai
A ReduceSinkOperator emits data in the format of keys and values. Right now, a
column may appear in both the key list and value list, which result in
unnecessary overhead for shuffling.
Example:
We have a query shown below ...
{code:sql}
explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
{\code}
The plan is ...
{code}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
store_sales
TableScan
alias: store_sales
Select Operator
expressions:
expr: ss_ticket_number
type: int
outputColumnNames: _col0
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col0
type: int
Reduce Operator Tree:
Extract
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{\code}
The column 'ss_ticket_number' is in both the key list and value list of the
ReduceSinkOperator. The type of ss_ticket_number is int. For this case,
BinarySortableSerDe will introduce 1 byte more for every int in the key.
LazyBinarySerDe will also introduce overhead when recording the length of a
int. For every int, 10 bytes should be a rough estimation of the size of data
emitted from the Map phase.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira