[
https://issues.apache.org/jira/browse/HIVE-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prasanth J updated HIVE-8168:
-----------------------------
Attachment: HIVE-8168.3.patch
Previous patch added non-join column filter on FK side instead of PK side.
Added more tests in this patch.
> With dynamic partition enabled fact table selectivity is not taken into
> account when generating the physical plan (Use CBO cardinality using physical
> plan generation)
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-8168
> URL: https://issues.apache.org/jira/browse/HIVE-8168
> Project: Hive
> Issue Type: Bug
> Components: Tez
> Affects Versions: 0.14.0
> Reporter: Mostafa Mokhtar
> Assignee: Prasanth J
> Priority: Critical
> Labels: performance
> Fix For: vectorization-branch, 0.14.0
>
> Attachments: HIVE-8168.1.patch, HIVE-8168.2.patch, HIVE-8168.3.patch
>
>
> When calculating estimate row counts & data size during physical plan
> generation in StatsRulesProcFactory doesn't know that there will be dynamic
> partition pruning and it is hard to know how many partitions will qualify at
> runtime, as a result with Dynamic partition pruning enabled a query 32 can
> run with 570 compared to 70 tasks with dynamic partition pruning disabled and
> actual partition filters on the fact table.
> The long term solution for this issue is to use the cardinality estimates
> from CBO as it takes into account join selectivity and such, estimate from
> CBO won't address the number of the tasks used for the partitioned table but
> they will address the incorrect number of tasks used for the concequent
> reducers where the majority of the slowdown is coming from.
> Plan dynamic partition pruning on
> {code}
> Map 5
> Map Operator Tree:
> TableScan
> alias: ss
> filterExpr: ss_store_sk is not null (type: boolean)
> Statistics: Num rows: 550076554 Data size: 47370018896
> Basic stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: ss_store_sk is not null (type: boolean)
> Statistics: Num rows: 275038277 Data size: 23685009448
> Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {ss_store_sk} {ss_net_profit}
> 1
> keys:
> 0 ss_sold_date_sk (type: int)
> 1 d_date_sk (type: int)
> outputColumnNames: _col6, _col21
> input vertices:
> 1 Map 1
> Statistics: Num rows: 302542112 Data size: 26053511168
> Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {_col21}
> 1 {s_county} {s_state}
> keys:
> 0 _col6 (type: int)
> 1 s_store_sk (type: int)
> outputColumnNames: _col21, _col80, _col81
> input vertices:
> 1 Map 2
> Statistics: Num rows: 332796320 Data size:
> 28658862080 Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Left Semi Join 0 to 1
> condition expressions:
> 0 {_col21} {_col80} {_col81}
> 1
> keys:
> 0 _col81 (type: string)
> 1 _col0 (type: string)
> outputColumnNames: _col21, _col80, _col81
> input vertices:
> 1 Reducer 11
> Statistics: Num rows: 366075968 Data size:
> 31524749312 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: _col81 (type: string), _col80 (type:
> string), _col21 (type: float)
> outputColumnNames: _col81, _col80, _col21
> Statistics: Num rows: 366075968 Data size:
> 31524749312 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: sum(_col21)
> keys: _col81 (type: string), _col80 (type:
> string), '0' (type: string)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 1098227904 Data size:
> 94574247936 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: string), _col1
> (type: string), _col2 (type: string)
> sort order: +++
> Map-reduce partition columns: _col0 (type:
> string), _col1 (type: string), _col2 (type: string)
> Statistics: Num rows: 1098227904 Data size:
> 94574247936 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col3 (type: double)
> {code}
> Plan snippet with partition pruning off and explicit partition filters
> {code}
> Map 5
> Map Operator Tree:
> TableScan
> alias: ss
> filterExpr: ((ss_sold_date_sk is not null and ss_store_sk
> is not null) and ss_sold_date BETWEEN '1999-06-01' AND '2000-05-31') (type:
> boolean)
> Statistics: Num rows: 110339135 Data size: 4817453454 Basic
> stats: COMPLETE Column stats: NONE
> Filter Operator
> predicate: (ss_sold_date_sk is not null and ss_store_sk
> is not null) (type: boolean)
> Statistics: Num rows: 27584784 Data size: 1204363374
> Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {ss_store_sk} {ss_net_profit}
> 1
> keys:
> 0 ss_sold_date_sk (type: int)
> 1 d_date_sk (type: int)
> outputColumnNames: _col7, _col22
> input vertices:
> 1 Map 1
> Statistics: Num rows: 30343264 Data size: 1324799744
> Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {_col22}
> 1 {s_county} {s_state}
> keys:
> 0 _col7 (type: int)
> 1 s_store_sk (type: int)
> outputColumnNames: _col22, _col81, _col82
> input vertices:
> 1 Map 2
> Statistics: Num rows: 33377592 Data size: 1457279744
> Basic stats: COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Left Semi Join 0 to 1
> condition expressions:
> 0 {_col22} {_col81} {_col82}
> 1
> keys:
> 0 _col82 (type: string)
> 1 _col0 (type: string)
> outputColumnNames: _col22, _col81, _col82
> input vertices:
> 1 Reducer 11
> Statistics: Num rows: 36715352 Data size:
> 1603007744 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: _col82 (type: string), _col81 (type:
> string), _col22 (type: float)
> outputColumnNames: _col82, _col81, _col22
> Statistics: Num rows: 36715352 Data size:
> 1603007744 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: sum(_col22)
> keys: _col82 (type: string), _col81 (type:
> string), '0' (type: string)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 36715352 Data size:
> 1603007744 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: string), _col1
> (type: string), _col2 (type: string)
> sort order: +++
> Map-reduce partition columns: _col0 (type:
> string), _col1 (type: string), _col2 (type: string)
> Statistics: Num rows: 36715352 Data size:
> 1603007744 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col3 (type: double)
> {code}
> Plan snippet with partition pruning on and fetch column stats also on , in
> this case row estimate a
> {code}
> Map 5
> Map Operator Tree:
> TableScan
> alias: ss
> filterExpr: ss_store_sk is not null (type: boolean)
> Statistics: Num rows: 550076554 Data size: 47370018896
> Basic stats: COMPLETE Column stats: COMPLETE
> Filter Operator
> predicate: ss_store_sk is not null (type: boolean)
> Statistics: Num rows: 537120379 Data size: 4195767284
> Basic stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {ss_store_sk} {ss_net_profit}
> 1
> keys:
> 0 ss_sold_date_sk (type: int)
> 1 d_date_sk (type: int)
> outputColumnNames: _col6, _col21
> input vertices:
> 1 Map 1
> Statistics: Num rows: 36524 Data size: 292192 Basic
> stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {_col21}
> 1 {s_county} {s_state}
> keys:
> 0 _col6 (type: int)
> 1 s_store_sk (type: int)
> outputColumnNames: _col21, _col80, _col81
> input vertices:
> 1 Map 2
> Statistics: Num rows: 45017 Data size: 8508213 Basic
> stats: COMPLETE Column stats: COMPLETE
> Map Join Operator
> condition map:
> Left Semi Join 0 to 1
> condition expressions:
> 0 {_col21} {_col80} {_col81}
> 1
> keys:
> 0 _col81 (type: string)
> 1 _col0 (type: string)
> outputColumnNames: _col21, _col80, _col81
> input vertices:
> 1 Reducer 11
> Statistics: Num rows: 49518 Data size: 9359035
> Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: _col81 (type: string), _col80 (type:
> string), _col21 (type: float)
> outputColumnNames: _col81, _col80, _col21
> Statistics: Num rows: 49518 Data size: 9359035
> Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: sum(_col21)
> keys: _col81 (type: string), _col80 (type:
> string), '0' (type: string)
> mode: hash
> outputColumnNames: _col0, _col1, _col2, _col3
> Statistics: Num rows: 148554 Data size:
> 28077105 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: string), _col1
> (type: string), _col2 (type: string)
> sort order: +++
> Map-reduce partition columns: _col0 (type:
> string), _col1 (type: string), _col2 (type: string)
> Statistics: Num rows: 148554 Data size:
> 28077105 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col3 (type: double)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)