Rajesh Balamohan created HIVE-27084:
---------------------------------------

             Summary: Iceberg: Stats are not populated correctly during query 
compilation
                 Key: HIVE-27084
                 URL: https://issues.apache.org/jira/browse/HIVE-27084
             Project: Hive
          Issue Type: Improvement
          Components: Iceberg integration
            Reporter: Rajesh Balamohan


- Table stats are not properly used/computed during query compilation phase.
 - Here is an example. Check out the query with the filter which give more data 
than the regular query

This is just an example, real world queries can have bad query plans due to this

{{10470974584 with filter, vs 303658262936 without filter}}

{noformat}
explain select count(*) from store_sales where ss_sold_date_sk=2450822 and 
ss_wholesale_cost > 0.0

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
      Edges:
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ((ss_sold_date_sk = 2450822) and 
(ss_wholesale_cost > 0)) (type: boolean)
                  Statistics: Num rows: 2755519629 Data size: 303658262936 
Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: ((ss_sold_date_sk = 2450822) and 
(ss_wholesale_cost > 0)) (type: boolean)
                    Statistics: Num rows: 5 Data size: 550 Basic stats: 
COMPLETE Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 5 Data size: 550 Basic stats: 
COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count()
                        minReductionHashAggr: 0.99
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 124 Basic stats: 
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          null sort order:
                          sort order:
                          Statistics: Num rows: 1 Data size: 124 Basic stats: 
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
            LLAP IO: all inputs (cache only)
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

58 rows selected (0.73 seconds)



explain select count(*) from store_sales where ss_sold_date_sk=2450822
INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
INFO  : Completed executing 
command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); Time 
taken: 0.061 seconds
INFO  : OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
      Edges:
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_sold_date_sk = 2450822) (type: boolean)
                  Statistics: Num rows: 2755519629 Data size: 10470974584 Basic 
stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: (ss_sold_date_sk = 2450822) (type: boolean)
                    Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      Statistics: Num rows: 5 Data size: 18 Basic stats: 
COMPLETE Column stats: NONE
                      Group By Operator
                        aggregations: count()
                        minReductionHashAggr: 0.99
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 12 Basic stats: 
COMPLETE Column stats: NONE
                        Reduce Output Operator
                          null sort order:
                          sort order:
                          Statistics: Num rows: 1 Data size: 12 Basic stats: 
COMPLETE Column stats: NONE
                          value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
            LLAP IO: all inputs (cache only)
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

58 rows selected (0.529 seconds)

{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to