Mostafa Mokhtar created HIVE-8517:
-------------------------------------

             Summary: When joining on partition column NDV gets overridden by 
StatsUtils.getColStatisticsFromExpression
                 Key: HIVE-8517
                 URL: https://issues.apache.org/jira/browse/HIVE-8517
             Project: Hive
          Issue Type: Bug
          Components: Physical Optimizer
    Affects Versions: 0.14.0
            Reporter: Mostafa Mokhtar
             Fix For: 0.14.0


When joining on partition column number of partitions is used as NDV which gets 
overridden by StatsUtils.getColStatisticsFromExpression and the number of 
partitions used as NDV is replaced by number of rows which results in the same 
behavior as explained in https://issues.apache.org/jira/browse/HIVE-8196. 
"Joining on partition columns with fetch column stats enabled results it very 
small CE which negatively affects query performance "

This is the call stack.
{code}
StatsUtils.getColStatisticsFromExpression(HiveConf, Statistics, ExprNodeDesc) 
line: 1001        
StatsRulesProcFactory$ReduceSinkStatsRule.process(Node, Stack<Node>, 
NodeProcessorCtx, Object...) line: 1479    
DefaultRuleDispatcher.dispatch(Node, Stack<Node>, Object...) line: 90   
PreOrderWalker(DefaultGraphWalker).dispatchAndReturn(Node, Stack<Node>) line: 
94        
PreOrderWalker(DefaultGraphWalker).dispatch(Node, Stack<Node>) line: 78 
PreOrderWalker.walk(Node) line: 54      
PreOrderWalker.walk(Node) line: 59      
PreOrderWalker.walk(Node) line: 59      
PreOrderWalker(DefaultGraphWalker).startWalking(Collection<Node>, 
HashMap<Node,Object>) line: 109       
AnnotateWithStatistics.transform(ParseContext) line: 78 
TezCompiler.runStatsAnnotation(OptimizeTezProcContext) line: 248        
TezCompiler.optimizeOperatorPlan(ParseContext, Set<ReadEntity>, 
Set<WriteEntity>) line: 120     
TezCompiler(TaskCompiler).compile(ParseContext, List<Task<Serializable>>, 
HashSet<ReadEntity>, HashSet<WriteEntity>) line: 99   
SemanticAnalyzer.analyzeInternal(ASTNode) line: 10037   
SemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 221      
ExplainSemanticAnalyzer.analyzeInternal(ASTNode) line: 74       
ExplainSemanticAnalyzer(BaseSemanticAnalyzer).analyze(ASTNode, Context) line: 
221       
Driver.compile(String, boolean) line: 415       
{code}

Query
{code}

select
  ss_item_sk item_sk, d_date, sum(ss_sales_price),
  sum(sum(ss_sales_price))
      over (partition by ss_item_sk order by d_date rows between unbounded 
preceding and current row) cume_sales
from store_sales
    ,date_dim
where ss_sold_date_sk=d_date_sk
  and d_month_seq between 1193 and 1193+11
  and ss_item_sk is not NULL
group by ss_item_sk, d_date
{code}

Plan 
Notice in the Map join operator the number of rows drop from 82,510,879,939 to 
36524 after the join.
{code}
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 1 <- Map 4 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      DagName: mmokhtar_20141019131818_086d663a-5621-456c-bf25-8ccb7112ee3b:6
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_item_sk is not null (type: boolean)
                  Statistics: Num rows: 82510879939 Data size: 6873789738208 
Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_item_sk is not null (type: boolean)
                    Statistics: Num rows: 82510879939 Data size: 652315818272 
Basic stats: COMPLETE Column stats: COMPLETE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      condition expressions:
                        0 {ss_item_sk} {ss_sales_price} {ss_sold_date_sk}
                        1 {d_date_sk} {d_date} {d_month_seq}
                      keys:
                        0 ss_sold_date_sk (type: int)
                        1 d_date_sk (type: int)
                      outputColumnNames: _col1, _col12, _col22, _col26, _col28, 
_col29
                      input vertices:
                        1 Map 4
                      Statistics: Num rows: 36524 Data size: 4163736 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Filter Operator
                        predicate: (((_col22 = _col26) and _col29 BETWEEN 1193 
AND 1204) and _col1 is not null) (type: boolean)
                        Statistics: Num rows: 9131 Data size: 1040934 Basic 
stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          expressions: _col1 (type: int), _col28 (type: 
string), _col12 (type: float)
                          outputColumnNames: _col1, _col28, _col12
                          Statistics: Num rows: 9131 Data size: 1040934 Basic 
stats: COMPLETE Column stats: COMPLETE
                          Group By Operator
                            aggregations: sum(_col12)
                            keys: _col1 (type: int), _col28 (type: string)
                            mode: hash
                            outputColumnNames: _col0, _col1, _col2
                            Statistics: Num rows: 4565 Data size: 483890 Basic 
stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col0 (type: int), _col1 (type: 
string)
                              sort order: ++
                              Map-reduce partition columns: _col0 (type: int), 
_col1 (type: string)
                              Statistics: Num rows: 4565 Data size: 483890 
Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col2 (type: double)
            Execution mode: vectorized
        Map 4
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (d_date_sk is not null and d_month_seq BETWEEN 
1193 AND 1204) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (d_date_sk is not null and d_month_seq BETWEEN 
1193 AND 1204) (type: boolean)
                    Statistics: Num rows: 36524 Data size: 3725448 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Reduce Output Operator
                      key expressions: d_date_sk (type: int)
                      sort order: +
                      Map-reduce partition columns: d_date_sk (type: int)
                      Statistics: Num rows: 36524 Data size: 3725448 Basic 
stats: COMPLETE Column stats: COMPLETE
                      value expressions: d_date (type: string), d_month_seq 
(type: int)
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 36524 Data size: 0 Basic stats: 
PARTIAL Column stats: COMPLETE
                      Group By Operator
                        keys: _col0 (type: int)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 36524 Data size: 0 Basic stats: 
PARTIAL Column stats: COMPLETE
                        Dynamic Partitioning Event Operator
                          Target Input: store_sales
                          Partition key expr: ss_sold_date_sk
                          Statistics: Num rows: 36524 Data size: 0 Basic stats: 
PARTIAL Column stats: COMPLETE
                          Target column: ss_sold_date_sk
                          Target Vertex: Map 1
            Execution mode: vectorized
        Reducer 2
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0)
                keys: KEY._col0 (type: int), KEY._col1 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 4565 Data size: 502150 Basic stats: 
COMPLETE Column stats: COMPLETE
                Reduce Output Operator
                  key expressions: _col0 (type: int), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 4565 Data size: 502150 Basic stats: 
COMPLETE Column stats: COMPLETE
                  value expressions: _col0 (type: int), _col1 (type: string), 
_col2 (type: double)
            Execution mode: vectorized
        Reducer 3
            Reduce Operator Tree:
              Extract
                Statistics: Num rows: 4565 Data size: 502150 Basic stats: 
COMPLETE Column stats: COMPLETE
                PTF Operator
                  Statistics: Num rows: 4565 Data size: 502150 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: string), _col2 
(type: double), _wcol0 (type: double)
                    outputColumnNames: _col0, _col1, _col2, _col3
                    Statistics: Num rows: 4565 Data size: 36520 Basic stats: 
COMPLETE Column stats: COMPLETE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 4565 Data size: 36520 Basic stats: 
COMPLETE Column stats: COMPLETE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to