Mostafa Mokhtar created HIVE-8517: ------------------------------------- Summary: When joining on partition column NDV gets overridden by StatsUtils.getColStatisticsFromExpression Key: HIVE-8517 URL: https://issues.apache.org/jira/browse/HIVE-8517 Project: Hive Issue Type: Bug Components: Physical Optimizer Affects Versions: 0.14.0 Reporter: Mostafa Mokhtar Fix For: 0.14.0
When joining on partition column number of partitions is used as NDV which gets overridden by StatsUtils.getColStatisticsFromExpression and the number of partitions used as NDV is replaced by number of rows which results in the same behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196. "Joining on partition columns with fetch column stats enabled results it very small CE which negatively affects query performance " This is the call stack. {code} StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) line: 1001 StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, NodeProcessorCtx, Object...) line: 1479 DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90 PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 94 PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78 PreOrderWalker.walk(Node) line: 54 PreOrderWalker.walk(Node) line: 59 PreOrderWalker.walk(Node) line: 59 PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, HashMap<Node,Object>) line: 109 AnnotateWithStatistics.transform(ParseContext) line: 78 TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248 TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, Set<WriteEntity>) line: 120 TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>, HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99 SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037 SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221 ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74 ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221 Driver.compile(String, boolean) line: 415 {code} Query {code} select ss_item_sk item_sk, d_date, sum(ss_sales_price), sum(sum(ss_sales_price)) over (partition by ss_item_sk order by d_date rows between unbounded preceding and current row) cume_sales from store_sales ,date_dim where ss_sold_date_sk=d_date_sk and d_month_seq between 1193 and 1193+11 and ss_item_sk is not NULL group by ss_item_sk, d_date {code} Plan Notice in the Map join operator the number of rows drop from 82,510,879,939 to 36524 after the join. {code} OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez Edges: Map 1 <- Map 4 (BROADCAST_EDGE) Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ss_item_sk is not null (type: boolean) Statistics: Num rows: 82510879939 Data size: 6873789738208 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ss_item_sk is not null (type: boolean) Statistics: Num rows: 82510879939 Data size: 652315818272 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk} 1 {d_date_sk} {d_date} {d_month_seq} keys: 0 ss_sold_date_sk (type: int) 1 d_date_sk (type: int) outputColumnNames: _col1, _col12, _col22, _col26, _col28, _col29 input vertices: 1 Map 4 Statistics: Num rows: 36524 Data size: 4163736 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 AND 1204) and _col1 is not null) (type: boolean) Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col1 (type: int), _col28 (type: string), _col12 (type: float) outputColumnNames: _col1, _col28, _col12 Statistics: Num rows: 9131 Data size: 1040934 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: sum(_col12) keys: _col1 (type: int), _col28 (type: string) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: int), _col1 (type: string) Statistics: Num rows: 4565 Data size: 483890 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col2 (type: double) Execution mode: vectorized Map 4 Map Operator Tree: TableScan alias: date_dim filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean) Statistics: Num rows: 73049 Data size: 81741831 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (d_date_sk is not null and d_month_seq BETWEEN 1193 AND 1204) (type: boolean) Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: d_date_sk (type: int) sort order: + Map-reduce partition columns: d_date_sk (type: int) Statistics: Num rows: 36524 Data size: 3725448 Basic stats: COMPLETE Column stats: COMPLETE value expressions: d_date (type: string), d_month_seq (type: int) Select Operator expressions: d_date_sk (type: int) outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Group By Operator keys: _col0 (type: int) mode: hash outputColumnNames: _col0 Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Dynamic Partitioning Event Operator Target Input: store_sales Partition key expr: ss_sold_date_sk Statistics: Num rows: 36524 Data size: 0 Basic stats: PARTIAL Column stats: COMPLETE Target column: ss_sold_date_sk Target Vertex: Map 1 Execution mode: vectorized Reducer 2 Reduce Operator Tree: Group By Operator aggregations: sum(VALUE._col0) keys: KEY._col0 (type: int), KEY._col1 (type: string) mode: mergepartial outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double) Execution mode: vectorized Reducer 3 Reduce Operator Tree: Extract Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE PTF Operator Statistics: Num rows: 4565 Data size: 502150 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: int), _col1 (type: string), _col2 (type: double), _wcol0 (type: double) outputColumnNames: _col0, _col1, _col2, _col3 Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 4565 Data size: 36520 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)