[ https://issues.apache.org/jira/browse/HIVE-24764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan reassigned HIVE-24764: --------------------------------------- Assignee: Rajesh Balamohan > insert overwrite on a partition resets row count stats in other partitions > -------------------------------------------------------------------------- > > Key: HIVE-24764 > URL: https://issues.apache.org/jira/browse/HIVE-24764 > Project: Hive > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > After insert overwrite on a partition, stats on other partitions are messed > up. Subsequent queries end up with plans with PARTIAL stats. In certain > cases, this leads to suboptimal query plans. > {noformat} > drop table if exists test_stats; > drop table if exists test_stats_2; > create table test_stats(i int, j bigint); > create table test_stats_2(i int) partitioned by (j bigint); > insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); > -- select * from test_stats; > 1 1 > 2 2 > 3 3 > 4 4 > 5 <null> > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is not null; > -- After executing this statement, stat gets messed up. > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is null; > -- select * from test_stats_2; > 1 1 > 2 2 > 3 3 > 4 4 > 5 <null> > -- This would return "PARTIAL" stats instead of "COMPLETE" > explain select i, count(*) as c from test_stats_2 group by i order by c desc > limit 10; > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test_stats_2 > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Select Operator > expressions: i (type: int) > outputColumnNames: i > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > aggregations: count() > keys: i (type: int) > minReductionHashAggr: 0.99 > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 125 Data size: 1500 Basic stats: > PARTIAL Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 125 Data size: 1500 Basic > stats: PARTIAL Column stats: COMPLETE > value expressions: _col1 (type: bigint) > Execution mode: vectorized, llap > LLAP IO: may be used (ACID table) > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > mode: mergepartial > outputColumnNames: _col0, _col1 > Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL > Column stats: COMPLETE > Top N Key Operator > sort order: - > keys: _col1 (type: bigint) > null sort order: a > Statistics: Num rows: 62 Data size: 744 Basic stats: > PARTIAL Column stats: COMPLETE > top n: 10 > Reduce Output Operator > key expressions: _col1 (type: bigint) > null sort order: a > sort order: - > Statistics: Num rows: 62 Data size: 744 Basic stats: > PARTIAL Column stats: COMPLETE > TopN Hash Memory Usage: 0.04 > value expressions: _col0 (type: int) > Reducer 3 > Execution mode: vectorized, llap > Reduce Operator Tree: > Select Operator > expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 > (type: bigint) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL > Column stats: COMPLETE > Limit > Number of rows: 10 > Statistics: Num rows: 10 Data size: 120 Basic stats: > PARTIAL Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 10 Data size: 120 Basic stats: > PARTIAL Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: 10 > Processor Tree: > ListSink > {noformat} > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138 > FSStats should return "null" or "" if partitions are not present in its list. -- This message was sent by Atlassian Jira (v8.3.4#803005)