[ 
https://issues.apache.org/jira/browse/HIVE-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth J updated HIVE-8168:
-----------------------------
    Status: Patch Available  (was: Open)

> With dynamic partition enabled fact table selectivity is not taken into 
> account when generating the physical plan (Use CBO cardinality using physical 
> plan generation)
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-8168
>                 URL: https://issues.apache.org/jira/browse/HIVE-8168
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 0.14.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Prasanth J
>            Priority: Critical
>              Labels: performance
>             Fix For: vectorization-branch, 0.14.0
>
>         Attachments: HIVE-8168.1.patch
>
>
> When calculating estimate row counts & data size during physical plan 
> generation in StatsRulesProcFactory doesn't know that there will be dynamic 
> partition pruning and it is hard to know how many partitions will qualify at 
> runtime, as a result with Dynamic partition pruning enabled a query 32 can 
> run with 570 compared to 70 tasks with dynamic partition pruning disabled and 
> actual partition filters on the fact table.
> The long term solution for this issue is to use the cardinality estimates 
> from CBO as it takes into account join selectivity and such, estimate from 
> CBO won't address the number of the tasks used for the partitioned table but 
> they will address the incorrect number of tasks used for the concequent 
> reducers where the majority of the slowdown is coming from.
> Plan dynamic partition pruning on 
> {code}
>    Map 5 
>             Map Operator Tree:
>                 TableScan
>                   alias: ss
>                   filterExpr: ss_store_sk is not null (type: boolean)
>                   Statistics: Num rows: 550076554 Data size: 47370018896 
> Basic stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: ss_store_sk is not null (type: boolean)
>                     Statistics: Num rows: 275038277 Data size: 23685009448 
> Basic stats: COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {ss_store_sk} {ss_net_profit}
>                         1 
>                       keys:
>                         0 ss_sold_date_sk (type: int)
>                         1 d_date_sk (type: int)
>                       outputColumnNames: _col6, _col21
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 302542112 Data size: 26053511168 
> Basic stats: COMPLETE Column stats: NONE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         condition expressions:
>                           0 {_col21}
>                           1 {s_county} {s_state}
>                         keys:
>                           0 _col6 (type: int)
>                           1 s_store_sk (type: int)
>                         outputColumnNames: _col21, _col80, _col81
>                         input vertices:
>                           1 Map 2
>                         Statistics: Num rows: 332796320 Data size: 
> 28658862080 Basic stats: COMPLETE Column stats: NONE
>                         Map Join Operator
>                           condition map:
>                                Left Semi Join 0 to 1
>                           condition expressions:
>                             0 {_col21} {_col80} {_col81}
>                             1 
>                           keys:
>                             0 _col81 (type: string)
>                             1 _col0 (type: string)
>                           outputColumnNames: _col21, _col80, _col81
>                           input vertices:
>                             1 Reducer 11
>                           Statistics: Num rows: 366075968 Data size: 
> 31524749312 Basic stats: COMPLETE Column stats: NONE
>                           Select Operator
>                             expressions: _col81 (type: string), _col80 (type: 
> string), _col21 (type: float)
>                             outputColumnNames: _col81, _col80, _col21
>                             Statistics: Num rows: 366075968 Data size: 
> 31524749312 Basic stats: COMPLETE Column stats: NONE
>                             Group By Operator
>                               aggregations: sum(_col21)
>                               keys: _col81 (type: string), _col80 (type: 
> string), '0' (type: string)
>                               mode: hash
>                               outputColumnNames: _col0, _col1, _col2, _col3
>                               Statistics: Num rows: 1098227904 Data size: 
> 94574247936 Basic stats: COMPLETE Column stats: NONE
>                               Reduce Output Operator
>                                 key expressions: _col0 (type: string), _col1 
> (type: string), _col2 (type: string)
>                                 sort order: +++
>                                 Map-reduce partition columns: _col0 (type: 
> string), _col1 (type: string), _col2 (type: string)
>                                 Statistics: Num rows: 1098227904 Data size: 
> 94574247936 Basic stats: COMPLETE Column stats: NONE
>                                 value expressions: _col3 (type: double)
> {code}
> Plan snippet with partition pruning off and explicit partition filters
> {code}
>     Map 5 
>             Map Operator Tree:
>                 TableScan
>                   alias: ss
>                   filterExpr: ((ss_sold_date_sk is not null and ss_store_sk 
> is not null) and ss_sold_date BETWEEN '1999-06-01' AND '2000-05-31') (type: 
> boolean)
>                   Statistics: Num rows: 110339135 Data size: 4817453454 Basic 
> stats: COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: (ss_sold_date_sk is not null and ss_store_sk 
> is not null) (type: boolean)
>                     Statistics: Num rows: 27584784 Data size: 1204363374 
> Basic stats: COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {ss_store_sk} {ss_net_profit}
>                         1 
>                       keys:
>                         0 ss_sold_date_sk (type: int)
>                         1 d_date_sk (type: int)
>                       outputColumnNames: _col7, _col22
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 30343264 Data size: 1324799744 
> Basic stats: COMPLETE Column stats: NONE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         condition expressions:
>                           0 {_col22}
>                           1 {s_county} {s_state}
>                         keys:
>                           0 _col7 (type: int)
>                           1 s_store_sk (type: int)
>                         outputColumnNames: _col22, _col81, _col82
>                         input vertices:
>                           1 Map 2
>                         Statistics: Num rows: 33377592 Data size: 1457279744 
> Basic stats: COMPLETE Column stats: NONE
>                         Map Join Operator
>                           condition map:
>                                Left Semi Join 0 to 1
>                           condition expressions:
>                             0 {_col22} {_col81} {_col82}
>                             1 
>                           keys:
>                             0 _col82 (type: string)
>                             1 _col0 (type: string)
>                           outputColumnNames: _col22, _col81, _col82
>                           input vertices:
>                             1 Reducer 11
>                           Statistics: Num rows: 36715352 Data size: 
> 1603007744 Basic stats: COMPLETE Column stats: NONE
>                           Select Operator
>                             expressions: _col82 (type: string), _col81 (type: 
> string), _col22 (type: float)
>                             outputColumnNames: _col82, _col81, _col22
>                             Statistics: Num rows: 36715352 Data size: 
> 1603007744 Basic stats: COMPLETE Column stats: NONE
>                             Group By Operator
>                               aggregations: sum(_col22)
>                               keys: _col82 (type: string), _col81 (type: 
> string), '0' (type: string)
>                               mode: hash
>                               outputColumnNames: _col0, _col1, _col2, _col3
>                               Statistics: Num rows: 36715352 Data size: 
> 1603007744 Basic stats: COMPLETE Column stats: NONE
>                               Reduce Output Operator
>                                 key expressions: _col0 (type: string), _col1 
> (type: string), _col2 (type: string)
>                                 sort order: +++
>                                 Map-reduce partition columns: _col0 (type: 
> string), _col1 (type: string), _col2 (type: string)
>                                 Statistics: Num rows: 36715352 Data size: 
> 1603007744 Basic stats: COMPLETE Column stats: NONE
>                                 value expressions: _col3 (type: double)
> {code}
> Plan snippet with partition pruning on and fetch column stats also on , in 
> this case row estimate a
> {code}
>       Map 5 
>             Map Operator Tree:
>                 TableScan
>                   alias: ss
>                   filterExpr: ss_store_sk is not null (type: boolean)
>                   Statistics: Num rows: 550076554 Data size: 47370018896 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ss_store_sk is not null (type: boolean)
>                     Statistics: Num rows: 537120379 Data size: 4195767284 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       condition expressions:
>                         0 {ss_store_sk} {ss_net_profit}
>                         1 
>                       keys:
>                         0 ss_sold_date_sk (type: int)
>                         1 d_date_sk (type: int)
>                       outputColumnNames: _col6, _col21
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 36524 Data size: 292192 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         condition expressions:
>                           0 {_col21}
>                           1 {s_county} {s_state}
>                         keys:
>                           0 _col6 (type: int)
>                           1 s_store_sk (type: int)
>                         outputColumnNames: _col21, _col80, _col81
>                         input vertices:
>                           1 Map 2
>                         Statistics: Num rows: 45017 Data size: 8508213 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                         Map Join Operator
>                           condition map:
>                                Left Semi Join 0 to 1
>                           condition expressions:
>                             0 {_col21} {_col80} {_col81}
>                             1 
>                           keys:
>                             0 _col81 (type: string)
>                             1 _col0 (type: string)
>                           outputColumnNames: _col21, _col80, _col81
>                           input vertices:
>                             1 Reducer 11
>                           Statistics: Num rows: 49518 Data size: 9359035 
> Basic stats: COMPLETE Column stats: NONE
>                           Select Operator
>                             expressions: _col81 (type: string), _col80 (type: 
> string), _col21 (type: float)
>                             outputColumnNames: _col81, _col80, _col21
>                             Statistics: Num rows: 49518 Data size: 9359035 
> Basic stats: COMPLETE Column stats: NONE
>                             Group By Operator
>                               aggregations: sum(_col21)
>                               keys: _col81 (type: string), _col80 (type: 
> string), '0' (type: string)
>                               mode: hash
>                               outputColumnNames: _col0, _col1, _col2, _col3
>                               Statistics: Num rows: 148554 Data size: 
> 28077105 Basic stats: COMPLETE Column stats: NONE
>                               Reduce Output Operator
>                                 key expressions: _col0 (type: string), _col1 
> (type: string), _col2 (type: string)
>                                 sort order: +++
>                                 Map-reduce partition columns: _col0 (type: 
> string), _col1 (type: string), _col2 (type: string)
>                                 Statistics: Num rows: 148554 Data size: 
> 28077105 Basic stats: COMPLETE Column stats: NONE
>                                 value expressions: _col3 (type: double)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to