Rajesh Balamohan created HIVE-24764: ---------------------------------------
Summary: insert overwrite on a partition resets row count stats in other partitions Key: HIVE-24764 URL: https://issues.apache.org/jira/browse/HIVE-24764 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan After insert overwrite on a partition, stats on other partitions are messed up. Subsequent queries end up with plans with PARTIAL stats. In certain cases, this leads to suboptimal query plans. {noformat} drop table if exists test_stats; drop table if exists test_stats_2; create table test_stats(i int, j bigint); create table test_stats_2(i int) partitioned by (j bigint); insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); -- select * from test_stats; 1 1 2 2 3 3 4 4 5 <null> insert overwrite table test_stats_2 partition(j) select i, j from test_stats where j is not null; -- After executing this statement, stat gets messed up. insert overwrite table test_stats_2 partition(j) select i, j from test_stats where j is null; -- select * from test_stats_2; 1 1 2 2 3 3 4 4 5 <null> -- This would return "PARTIAL" stats instead of "COMPLETE" explain select i, count(*) as c from test_stats_2 group by i order by c desc limit 10; STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 Vertices: Map 1 Map Operator Tree: TableScan alias: test_stats_2 Statistics: Num rows: 125 Data size: 500 Basic stats: PARTIAL Column stats: COMPLETE Select Operator expressions: i (type: int) outputColumnNames: i Statistics: Num rows: 125 Data size: 500 Basic stats: PARTIAL Column stats: COMPLETE Group By Operator aggregations: count() keys: i (type: int) minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 125 Data size: 1500 Basic stats: PARTIAL Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: int) null sort order: a sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 125 Data size: 1500 Basic stats: PARTIAL Column stats: COMPLETE value expressions: _col1 (type: bigint) Execution mode: vectorized, llap LLAP IO: may be used (ACID table) Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: int) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE Top N Key Operator sort order: - keys: _col1 (type: bigint) null sort order: a Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE top n: 10 Reduce Output Operator key expressions: _col1 (type: bigint) null sort order: a sort order: - Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE TopN Hash Memory Usage: 0.04 value expressions: _col0 (type: int) Reducer 3 Execution mode: vectorized, llap Reduce Operator Tree: Select Operator expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 (type: bigint) outputColumnNames: _col0, _col1 Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL Column stats: COMPLETE Limit Number of rows: 10 Statistics: Num rows: 10 Data size: 120 Basic stats: PARTIAL Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 10 Data size: 120 Basic stats: PARTIAL Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: 10 Processor Tree: ListSink {noformat} https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138 FSStats should return "null" or "" if partitions are not present in its list. -- This message was sent by Atlassian Jira (v8.3.4#803005)