[ https://issues.apache.org/jira/browse/HIVE-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14178982#comment-14178982 ]
Vikram Dixit K commented on HIVE-8168: -------------------------------------- +1 for 0.14 > With dynamic partition enabled fact table selectivity is not taken into > account when generating the physical plan (Use CBO cardinality using physical > plan generation) > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-8168 > URL: https://issues.apache.org/jira/browse/HIVE-8168 > Project: Hive > Issue Type: Bug > Components: Tez > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Prasanth J > Priority: Critical > Labels: performance > Fix For: vectorization-branch, 0.14.0 > > Attachments: HIVE-8168.1.patch, HIVE-8168.2.patch, HIVE-8168.3.patch, > HIVE-8168.4.patch, HIVE-8168.5.patch > > > When calculating estimate row counts & data size during physical plan > generation in StatsRulesProcFactory doesn't know that there will be dynamic > partition pruning and it is hard to know how many partitions will qualify at > runtime, as a result with Dynamic partition pruning enabled a query 32 can > run with 570 compared to 70 tasks with dynamic partition pruning disabled and > actual partition filters on the fact table. > The long term solution for this issue is to use the cardinality estimates > from CBO as it takes into account join selectivity and such, estimate from > CBO won't address the number of the tasks used for the partitioned table but > they will address the incorrect number of tasks used for the concequent > reducers where the majority of the slowdown is coming from. > Plan dynamic partition pruning on > {code} > Map 5 > Map Operator Tree: > TableScan > alias: ss > filterExpr: ss_store_sk is not null (type: boolean) > Statistics: Num rows: 550076554 Data size: 47370018896 > Basic stats: COMPLETE Column stats: NONE > Filter Operator > predicate: ss_store_sk is not null (type: boolean) > Statistics: Num rows: 275038277 Data size: 23685009448 > Basic stats: COMPLETE Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {ss_store_sk} {ss_net_profit} > 1 > keys: > 0 ss_sold_date_sk (type: int) > 1 d_date_sk (type: int) > outputColumnNames: _col6, _col21 > input vertices: > 1 Map 1 > Statistics: Num rows: 302542112 Data size: 26053511168 > Basic stats: COMPLETE Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col21} > 1 {s_county} {s_state} > keys: > 0 _col6 (type: int) > 1 s_store_sk (type: int) > outputColumnNames: _col21, _col80, _col81 > input vertices: > 1 Map 2 > Statistics: Num rows: 332796320 Data size: > 28658862080 Basic stats: COMPLETE Column stats: NONE > Map Join Operator > condition map: > Left Semi Join 0 to 1 > condition expressions: > 0 {_col21} {_col80} {_col81} > 1 > keys: > 0 _col81 (type: string) > 1 _col0 (type: string) > outputColumnNames: _col21, _col80, _col81 > input vertices: > 1 Reducer 11 > Statistics: Num rows: 366075968 Data size: > 31524749312 Basic stats: COMPLETE Column stats: NONE > Select Operator > expressions: _col81 (type: string), _col80 (type: > string), _col21 (type: float) > outputColumnNames: _col81, _col80, _col21 > Statistics: Num rows: 366075968 Data size: > 31524749312 Basic stats: COMPLETE Column stats: NONE > Group By Operator > aggregations: sum(_col21) > keys: _col81 (type: string), _col80 (type: > string), '0' (type: string) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3 > Statistics: Num rows: 1098227904 Data size: > 94574247936 Basic stats: COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 > (type: string), _col2 (type: string) > sort order: +++ > Map-reduce partition columns: _col0 (type: > string), _col1 (type: string), _col2 (type: string) > Statistics: Num rows: 1098227904 Data size: > 94574247936 Basic stats: COMPLETE Column stats: NONE > value expressions: _col3 (type: double) > {code} > Plan snippet with partition pruning off and explicit partition filters > {code} > Map 5 > Map Operator Tree: > TableScan > alias: ss > filterExpr: ((ss_sold_date_sk is not null and ss_store_sk > is not null) and ss_sold_date BETWEEN '1999-06-01' AND '2000-05-31') (type: > boolean) > Statistics: Num rows: 110339135 Data size: 4817453454 Basic > stats: COMPLETE Column stats: NONE > Filter Operator > predicate: (ss_sold_date_sk is not null and ss_store_sk > is not null) (type: boolean) > Statistics: Num rows: 27584784 Data size: 1204363374 > Basic stats: COMPLETE Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {ss_store_sk} {ss_net_profit} > 1 > keys: > 0 ss_sold_date_sk (type: int) > 1 d_date_sk (type: int) > outputColumnNames: _col7, _col22 > input vertices: > 1 Map 1 > Statistics: Num rows: 30343264 Data size: 1324799744 > Basic stats: COMPLETE Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col22} > 1 {s_county} {s_state} > keys: > 0 _col7 (type: int) > 1 s_store_sk (type: int) > outputColumnNames: _col22, _col81, _col82 > input vertices: > 1 Map 2 > Statistics: Num rows: 33377592 Data size: 1457279744 > Basic stats: COMPLETE Column stats: NONE > Map Join Operator > condition map: > Left Semi Join 0 to 1 > condition expressions: > 0 {_col22} {_col81} {_col82} > 1 > keys: > 0 _col82 (type: string) > 1 _col0 (type: string) > outputColumnNames: _col22, _col81, _col82 > input vertices: > 1 Reducer 11 > Statistics: Num rows: 36715352 Data size: > 1603007744 Basic stats: COMPLETE Column stats: NONE > Select Operator > expressions: _col82 (type: string), _col81 (type: > string), _col22 (type: float) > outputColumnNames: _col82, _col81, _col22 > Statistics: Num rows: 36715352 Data size: > 1603007744 Basic stats: COMPLETE Column stats: NONE > Group By Operator > aggregations: sum(_col22) > keys: _col82 (type: string), _col81 (type: > string), '0' (type: string) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3 > Statistics: Num rows: 36715352 Data size: > 1603007744 Basic stats: COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 > (type: string), _col2 (type: string) > sort order: +++ > Map-reduce partition columns: _col0 (type: > string), _col1 (type: string), _col2 (type: string) > Statistics: Num rows: 36715352 Data size: > 1603007744 Basic stats: COMPLETE Column stats: NONE > value expressions: _col3 (type: double) > {code} > Plan snippet with partition pruning on and fetch column stats also on , in > this case row estimate a > {code} > Map 5 > Map Operator Tree: > TableScan > alias: ss > filterExpr: ss_store_sk is not null (type: boolean) > Statistics: Num rows: 550076554 Data size: 47370018896 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ss_store_sk is not null (type: boolean) > Statistics: Num rows: 537120379 Data size: 4195767284 > Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {ss_store_sk} {ss_net_profit} > 1 > keys: > 0 ss_sold_date_sk (type: int) > 1 d_date_sk (type: int) > outputColumnNames: _col6, _col21 > input vertices: > 1 Map 1 > Statistics: Num rows: 36524 Data size: 292192 Basic > stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col21} > 1 {s_county} {s_state} > keys: > 0 _col6 (type: int) > 1 s_store_sk (type: int) > outputColumnNames: _col21, _col80, _col81 > input vertices: > 1 Map 2 > Statistics: Num rows: 45017 Data size: 8508213 Basic > stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Left Semi Join 0 to 1 > condition expressions: > 0 {_col21} {_col80} {_col81} > 1 > keys: > 0 _col81 (type: string) > 1 _col0 (type: string) > outputColumnNames: _col21, _col80, _col81 > input vertices: > 1 Reducer 11 > Statistics: Num rows: 49518 Data size: 9359035 > Basic stats: COMPLETE Column stats: NONE > Select Operator > expressions: _col81 (type: string), _col80 (type: > string), _col21 (type: float) > outputColumnNames: _col81, _col80, _col21 > Statistics: Num rows: 49518 Data size: 9359035 > Basic stats: COMPLETE Column stats: NONE > Group By Operator > aggregations: sum(_col21) > keys: _col81 (type: string), _col80 (type: > string), '0' (type: string) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3 > Statistics: Num rows: 148554 Data size: > 28077105 Basic stats: COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 > (type: string), _col2 (type: string) > sort order: +++ > Map-reduce partition columns: _col0 (type: > string), _col1 (type: string), _col2 (type: string) > Statistics: Num rows: 148554 Data size: > 28077105 Basic stats: COMPLETE Column stats: NONE > value expressions: _col3 (type: double) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)