[ https://issues.apache.org/jira/browse/HIVE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasanth J updated HIVE-8196: ----------------------------- Attachment: HIVE-8196.2.patch > Joining on partition columns with fetch column stats enabled results it very > small CE which negatively affects query performance > --------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-8196 > URL: https://issues.apache.org/jira/browse/HIVE-8196 > Project: Hive > Issue Type: Bug > Components: Physical Optimizer > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Prasanth J > Priority: Blocker > Labels: performance > Fix For: 0.14.0 > > Attachments: HIVE-8196.1.patch, HIVE-8196.2.patch > > > To make the best out of dynamic partition pruning joins should be on the > partitioning columns which results in dynamically pruning the partitions from > the fact table based on the qualifying column keys from the dimension table, > this type of joins negatively effects on cardinality estimates with fetch > column stats enabled. > Currently we don't have statistics for partition columns and as a result NDV > is set to row count, doing that negatively affects the estimated join > selectivity from the join. > Workaround is to capture statistics for partition columns or use number of > partitions incase dynamic partitioning is used. > In StatsUtils.getColStatisticsFromExpression is where count distincts gets > set to row count > {code} > if (encd.getIsPartitionColOrVirtualCol()) { > // vitual columns > colType = encd.getTypeInfo().getTypeName(); > countDistincts = numRows; > oi = encd.getWritableObjectInspector(); > {code} > Query used to repro the issue : > {code} > set hive.stats.fetch.column.stats=true; > set hive.tez.dynamic.partition.pruning=true; > explain select d_date > from store_sales, date_dim > where > store_sales.ss_sold_date_sk = date_dim.d_date_sk and > date_dim.d_year = 1998; > {code} > Plan > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Map 1 <- Map 2 (BROADCAST_EDGE) > DagName: mmokhtar_20140919180404_945d29f5-d041-4420-9666-1c5d64fa6540:8 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: store_sales > filterExpr: ss_sold_date_sk is not null (type: boolean) > Statistics: Num rows: 550076554 Data size: 47370018816 > Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {ss_sold_date_sk} > 1 {d_date_sk} {d_date} > keys: > 0 ss_sold_date_sk (type: int) > 1 d_date_sk (type: int) > outputColumnNames: _col22, _col26, _col28 > input vertices: > 1 Map 2 > Statistics: Num rows: 652 Data size: 66504 Basic stats: > COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (_col22 = _col26) (type: boolean) > Statistics: Num rows: 326 Data size: 33252 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col28 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 326 Data size: 30644 Basic > stats: COMPLETE Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 326 Data size: 30644 Basic > stats: COMPLETE Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Execution mode: vectorized > Map 2 > Map Operator Tree: > TableScan > alias: date_dim > filterExpr: (d_date_sk is not null and (d_year = 1998)) > (type: boolean) > Statistics: Num rows: 73049 Data size: 81741831 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (d_date_sk is not null and (d_year = 1998)) > (type: boolean) > Statistics: Num rows: 652 Data size: 66504 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: d_date_sk (type: int) > sort order: + > Map-reduce partition columns: d_date_sk (type: int) > Statistics: Num rows: 652 Data size: 66504 Basic stats: > COMPLETE Column stats: COMPLETE > value expressions: d_date (type: string) > Select Operator > expressions: d_date_sk (type: int) > outputColumnNames: _col0 > Statistics: Num rows: 652 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > keys: _col0 (type: int) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 652 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Dynamic Partitioning Event Operator > Target Input: store_sales > Partition key expr: ss_sold_date_sk > Statistics: Num rows: 652 Data size: 0 Basic stats: > PARTIAL Column stats: COMPLETE > Target column: ss_sold_date_sk > Target Vertex: Map 1 > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)