Mostafa Mokhtar created HIVE-8517:
-------------------------------------
Summary: When joining on partition column NDV gets overridden by
StatsUtils.getColStatisticsFromExpression
Key: HIVE-8517
URL: https://issues.apache.org/jira/browse/HIVE-8517
Project: Hive
Issue Type: Bug
Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Fix For: 0.14.0
When joining on partition column number of partitions is used as NDV which gets
overridden by StatsUtils.getColStatisticsFromExpression and the number of
partitions used as NDV is replaced by number of rows which results in the same
behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196.
"Joining on partition columns with fetch column stats enabled results it very
small CE which negatively affects query performance "
This is the call stack.
{code}
StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc)
line: 1001
StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>,
NodeProcessorCtx, Object...) line: 1479
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90
PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line:
94
PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78
PreOrderWalker.walk(Node) line: 54
PreOrderWalker.walk(Node) line: 59
PreOrderWalker.walk(Node) line: 59
PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>,
HashMap<Node,Object>) line: 109
AnnotateWithStatistics.transform(ParseContext) line: 78
TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248
TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>,
Set<WriteEntity>) line: 120
TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>,
HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99
SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037
SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221
ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74
ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line:
221
Driver.compile(String, boolean) line: 415
{code}
Query
{code}
select
ss_item_sk item_sk, d_date, sum(ss_sales_price),
sum(sum(ss_sales_price))
over (partition by ss_item_sk order by d_date rows between unbounded
preceding and current row) cume_sales
from store_sales
,date_dim
where ss_sold_date_sk=d_date_sk
and d_month_seq between 1193 and 1193+11
and ss_item_sk is not NULL
group by ss_item_sk, d_date
{code}
Plan
Notice in the Map join operator the number of rows drop from 82,510,879,939 to
36524 after the join.
{code}
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 1 <- Map 4 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: ss_item_sk is not null (type: boolean)
Statistics: Num rows: 82510879939 Data size: 6873789738208
Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ss_item_sk is not null (type: boolean)
Statistics: Num rows: 82510879939 Data size: 652315818272
Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
1 {d_date_sk} {d_date} {d_month_seq}
keys:
0 ss_sold_date_sk (type: int)
1 d_date_sk (type: int)
outputColumnNames: _col1, _col12, _col22, _col26, _col28,
_col29
input vertices:
1 Map 4
Statistics: Num rows: 36524 Data size: 4163736 Basic
stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (((_col22 = _col26) and _col29 BETWEEN 1193
AND 1204) and _col1 is not null) (type: boolean)
Statistics: Num rows: 9131 Data size: 1040934 Basic
stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col1 (type: int), _col28 (type:
string), _col12 (type: float)
outputColumnNames: _col1, _col28, _col12
Statistics: Num rows: 9131 Data size: 1040934 Basic
stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: sum(_col12)
keys: _col1 (type: int), _col28 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4565 Data size: 483890 Basic
stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type:
string)
sort order: ++
Map-reduce partition columns: _col0 (type: int),
_col1 (type: string)
Statistics: Num rows: 4565 Data size: 483890
Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col2 (type: double)
Execution mode: vectorized
Map 4
Map Operator Tree:
TableScan
alias: date_dim
filterExpr: (d_date_sk is not null and d_month_seq BETWEEN
1193 AND 1204) (type: boolean)
Statistics: Num rows: 73049 Data size: 81741831 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (d_date_sk is not null and d_month_seq BETWEEN
1193 AND 1204) (type: boolean)
Statistics: Num rows: 36524 Data size: 3725448 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: d_date_sk (type: int)
sort order: +
Map-reduce partition columns: d_date_sk (type: int)
Statistics: Num rows: 36524 Data size: 3725448 Basic
stats: COMPLETE Column stats: COMPLETE
value expressions: d_date (type: string), d_month_seq
(type: int)
Select Operator
expressions: d_date_sk (type: int)
outputColumnNames: _col0
Statistics: Num rows: 36524 Data size: 0 Basic stats:
PARTIAL Column stats: COMPLETE
Group By Operator
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 36524 Data size: 0 Basic stats:
PARTIAL Column stats: COMPLETE
Dynamic Partitioning Event Operator
Target Input: store_sales
Partition key expr: ss_sold_date_sk
Statistics: Num rows: 36524 Data size: 0 Basic stats:
PARTIAL Column stats: COMPLETE
Target column: ss_sold_date_sk
Target Vertex: Map 1
Execution mode: vectorized
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: int), KEY._col1 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4565 Data size: 502150 Basic stats:
COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: int), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 4565 Data size: 502150 Basic stats:
COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: int), _col1 (type: string),
_col2 (type: double)
Execution mode: vectorized
Reducer 3
Reduce Operator Tree:
Extract
Statistics: Num rows: 4565 Data size: 502150 Basic stats:
COMPLETE Column stats: COMPLETE
PTF Operator
Statistics: Num rows: 4565 Data size: 502150 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2
(type: double), _wcol0 (type: double)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 4565 Data size: 36520 Basic stats:
COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 4565 Data size: 36520 Basic stats:
COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)