[
https://issues.apache.org/jira/browse/HIVE-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132462#comment-14132462
]
Prasanth J commented on HIVE-7156:
----------------------------------
[~gopalv] I just tried the same thing but on a smaller scale (scale=1). I am
seeing reduction in the number of rows for map side GBY.
{code}
hive> explain select distinct L_SHIPDATE from lineitem;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName:
pjayachandran_20140912184747_ea2ce271-e866-4bea-a854-d38ce01f8023:4
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: lineitem
Statistics: Num rows: 31538 Data size: 4386987 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: l_shipdate (type: string)
outputColumnNames: l_shipdate
Statistics: Num rows: 31538 Data size: 4386987 Basic stats:
COMPLETE Column stats: COMPLETE
Group By Operator
keys: l_shipdate (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1955 Data size: 183770 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1955 Data size: 183770 Basic
stats: COMPLETE Column stats: COMPLETE
{code}
The only reason where the number of rows increases is when column stats is not
available and map side parallelism is >1 (map side parallelism = table size /
max split size) which is the worst case. Can you verify again in you next run?
I will give it another shot with bigger scale dataset.
> Group-By operator stat-annotation only uses distinct approx to generate
> rollups
> -------------------------------------------------------------------------------
>
> Key: HIVE-7156
> URL: https://issues.apache.org/jira/browse/HIVE-7156
> Project: Hive
> Issue Type: Sub-task
> Affects Versions: 0.14.0
> Reporter: Gopal V
> Assignee: Prasanth J
> Attachments: HIVE-7156.1.patch, HIVE-7156.2.patch, HIVE-7156.3.patch,
> HIVE-7156.4.patch
>
>
> The stats annotation for a group-by only annotates the reduce-side row-count
> with the distinct values.
> The map-side gets the row-count as the rows output instead of distinct *
> parallelism, while the reducer side gets the correct parallelism.
> {code}
> hive> explain select distinct L_SHIPDATE from lineitem;
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: lineitem
> Statistics: Num rows: 5999989709 Data size: 4745677733354
> Basic stats: COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: l_shipdate (type: string)
> outputColumnNames: l_shipdate
> Statistics: Num rows: 5999989709 Data size: 4745677733354
> Basic stats: COMPLETE Column stats: COMPLETE
> Group By Operator
> keys: l_shipdate (type: string)
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 5999989709 Data size:
> 563999032646 Basic stats: COMPLETE Column stats: COMPLETE
> Reduce Output Operator
> key expressions: _col0 (type: string)
> sort order: +
> Map-reduce partition columns: _col0 (type: string)
> Statistics: Num rows: 5999989709 Data size:
> 563999032646 Basic stats: COMPLETE Column stats: COMPLETE
> Execution mode: vectorized
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> keys: KEY._col0 (type: string)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1955 Data size: 183770 Basic stats:
> COMPLETE Column stats: COMPLETE
> Select Operator
> expressions: _col0 (type: string)
> outputColumnNames: _col0
> Statistics: Num rows: 1955 Data size: 183770 Basic stats:
> COMPLETE Column stats: COMPLETE
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)