Mostafa Mokhtar created HIVE-8196:
-------------------------------------
Summary: Joining on partition columns with fetch column stats
enabled results it very small CE which negatively affects query performance
Key: HIVE-8196
URL: https://issues.apache.org/jira/browse/HIVE-8196
Project: Hive
Issue Type: Bug
Components: Tez
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Gunther Hagleitner
Priority: Critical
Fix For: vectorization-branch, 0.14.0
When calculating estimate row counts & data size during physical plan
generation in StatsRulesProcFactory doesn't know that there will be dynamic
partition pruning and it is hard to know how many partitions will qualify at
runtime, as a result with Dynamic partition pruning enabled a query 32 can run
with 570 compared to 70 tasks with dynamic partition pruning disabled and
actual partition filters on the fact table.
The long term solution for this issue is to use the cardinality estimates from
CBO as it takes into account join selectivity and such, estimate from CBO won't
address the number of the tasks used for the partitioned table but they will
address the incorrect number of tasks used for the concequent reducers where
the majority of the slowdown is coming from.
Plan dynamic partition pruning on
{code}
Map 5
Map Operator Tree:
TableScan
alias: ss
filterExpr: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 550076554 Data size: 47370018896 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 275038277 Data size: 23685009448
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {ss_store_sk} {ss_net_profit}
1
keys:
0 ss_sold_date_sk (type: int)
1 d_date_sk (type: int)
outputColumnNames: _col6, _col21
input vertices:
1 Map 1
Statistics: Num rows: 302542112 Data size: 26053511168
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {_col21}
1 {s_county} {s_state}
keys:
0 _col6 (type: int)
1 s_store_sk (type: int)
outputColumnNames: _col21, _col80, _col81
input vertices:
1 Map 2
Statistics: Num rows: 332796320 Data size: 28658862080
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Semi Join 0 to 1
condition expressions:
0 {_col21} {_col80} {_col81}
1
keys:
0 _col81 (type: string)
1 _col0 (type: string)
outputColumnNames: _col21, _col80, _col81
input vertices:
1 Reducer 11
Statistics: Num rows: 366075968 Data size:
31524749312 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col81 (type: string), _col80 (type:
string), _col21 (type: float)
outputColumnNames: _col81, _col80, _col21
Statistics: Num rows: 366075968 Data size:
31524749312 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col21)
keys: _col81 (type: string), _col80 (type:
string), '0' (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1098227904 Data size:
94574247936 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1
(type: string), _col2 (type: string)
sort order: +++
Map-reduce partition columns: _col0 (type:
string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 1098227904 Data size:
94574247936 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: double)
{code}
Plan snippet with partition pruning off and explicit partition filters
{code}
Map 5
Map Operator Tree:
TableScan
alias: ss
filterExpr: ((ss_sold_date_sk is not null and ss_store_sk is
not null) and ss_sold_date BETWEEN '1999-06-01' AND '2000-05-31') (type:
boolean)
Statistics: Num rows: 110339135 Data size: 4817453454 Basic
stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (ss_sold_date_sk is not null and ss_store_sk is
not null) (type: boolean)
Statistics: Num rows: 27584784 Data size: 1204363374 Basic
stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {ss_store_sk} {ss_net_profit}
1
keys:
0 ss_sold_date_sk (type: int)
1 d_date_sk (type: int)
outputColumnNames: _col7, _col22
input vertices:
1 Map 1
Statistics: Num rows: 30343264 Data size: 1324799744
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {_col22}
1 {s_county} {s_state}
keys:
0 _col7 (type: int)
1 s_store_sk (type: int)
outputColumnNames: _col22, _col81, _col82
input vertices:
1 Map 2
Statistics: Num rows: 33377592 Data size: 1457279744
Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Left Semi Join 0 to 1
condition expressions:
0 {_col22} {_col81} {_col82}
1
keys:
0 _col82 (type: string)
1 _col0 (type: string)
outputColumnNames: _col22, _col81, _col82
input vertices:
1 Reducer 11
Statistics: Num rows: 36715352 Data size: 1603007744
Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col82 (type: string), _col81 (type:
string), _col22 (type: float)
outputColumnNames: _col82, _col81, _col22
Statistics: Num rows: 36715352 Data size:
1603007744 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col22)
keys: _col82 (type: string), _col81 (type:
string), '0' (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 36715352 Data size:
1603007744 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1
(type: string), _col2 (type: string)
sort order: +++
Map-reduce partition columns: _col0 (type:
string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 36715352 Data size:
1603007744 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: double)
{code}
Plan snippet with partition pruning on and fetch column stats also on , in this
case row estimate a
{code}
Map 5
Map Operator Tree:
TableScan
alias: ss
filterExpr: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 550076554 Data size: 47370018896 Basic
stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ss_store_sk is not null (type: boolean)
Statistics: Num rows: 537120379 Data size: 4195767284 Basic
stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {ss_store_sk} {ss_net_profit}
1
keys:
0 ss_sold_date_sk (type: int)
1 d_date_sk (type: int)
outputColumnNames: _col6, _col21
input vertices:
1 Map 1
Statistics: Num rows: 36524 Data size: 292192 Basic
stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {_col21}
1 {s_county} {s_state}
keys:
0 _col6 (type: int)
1 s_store_sk (type: int)
outputColumnNames: _col21, _col80, _col81
input vertices:
1 Map 2
Statistics: Num rows: 45017 Data size: 8508213 Basic
stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Left Semi Join 0 to 1
condition expressions:
0 {_col21} {_col80} {_col81}
1
keys:
0 _col81 (type: string)
1 _col0 (type: string)
outputColumnNames: _col21, _col80, _col81
input vertices:
1 Reducer 11
Statistics: Num rows: 49518 Data size: 9359035 Basic
stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col81 (type: string), _col80 (type:
string), _col21 (type: float)
outputColumnNames: _col81, _col80, _col21
Statistics: Num rows: 49518 Data size: 9359035
Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col21)
keys: _col81 (type: string), _col80 (type:
string), '0' (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 148554 Data size: 28077105
Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1
(type: string), _col2 (type: string)
sort order: +++
Map-reduce partition columns: _col0 (type:
string), _col1 (type: string), _col2 (type: string)
Statistics: Num rows: 148554 Data size:
28077105 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: double)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)