Rajesh Balamohan created HIVE-27084: ---------------------------------------
Summary: Iceberg: Stats are not populated correctly during query compilation Key: HIVE-27084 URL: https://issues.apache.org/jira/browse/HIVE-27084 Project: Hive Issue Type: Improvement Components: Iceberg integration Reporter: Rajesh Balamohan - Table stats are not properly used/computed during query compilation phase. - Here is an example. Check out the query with the filter which give more data than the regular query This is just an example, real world queries can have bad query plans due to this {{10470974584 with filter, vs 303658262936 without filter}} {noformat} explain select count(*) from store_sales where ss_sold_date_sk=2450822 and ss_wholesale_cost > 0.0 Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48 Edges: Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ((ss_sold_date_sk = 2450822) and (ss_wholesale_cost > 0)) (type: boolean) Statistics: Num rows: 2755519629 Data size: 303658262936 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ss_sold_date_sk = 2450822) and (ss_wholesale_cost > 0)) (type: boolean) Statistics: Num rows: 5 Data size: 550 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 5 Data size: 550 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Execution mode: vectorized, llap LLAP IO: all inputs (cache only) Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink 58 rows selected (0.73 seconds) explain select count(*) from store_sales where ss_sold_date_sk=2450822 INFO : Starting task [Stage-3:EXPLAIN] in serial mode INFO : Completed executing command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); Time taken: 0.061 seconds INFO : OK Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49 Edges: Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7:49 Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_sold_date_sk = 2450822) (type: boolean) Statistics: Num rows: 2755519629 Data size: 10470974584 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (ss_sold_date_sk = 2450822) (type: boolean) Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE Column stats: NONE Select Operator Statistics: Num rows: 5 Data size: 18 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Execution mode: vectorized, llap LLAP IO: all inputs (cache only) Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 12 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink 58 rows selected (0.529 seconds) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)