Rajesh Balamohan created HIVE-24764:
---------------------------------------

             Summary: insert overwrite on a partition resets row count stats in 
other partitions
                 Key: HIVE-24764
                 URL: https://issues.apache.org/jira/browse/HIVE-24764
             Project: Hive
          Issue Type: Bug
            Reporter: Rajesh Balamohan


After insert overwrite on a partition, stats on other partitions are messed up. 
Subsequent queries end up with plans with PARTIAL stats. In certain cases, this 
leads to suboptimal query plans.

{noformat}

drop table if exists test_stats;
drop table if exists test_stats_2;

create table test_stats(i int, j bigint);
create table test_stats_2(i int) partitioned by (j bigint);

insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);

-- select * from test_stats;
1   1
2   2
3   3
4   4
5   <null>


insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
where j is not null;


-- After executing this statement, stat gets messed up.
insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
where j is null;

-- select * from test_stats_2;
1   1
2   2
3   3
4   4
5   <null>


-- This would return "PARTIAL" stats instead of "COMPLETE"
explain select i, count(*) as c from test_stats_2 group by i order by c desc 
limit 10;


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
      DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: test_stats_2
                  Statistics: Num rows: 125 Data size: 500 Basic stats: PARTIAL 
Column stats: COMPLETE
                  Select Operator
                    expressions: i (type: int)
                    outputColumnNames: i
                    Statistics: Num rows: 125 Data size: 500 Basic stats: 
PARTIAL Column stats: COMPLETE
                    Group By Operator
                      aggregations: count()
                      keys: i (type: int)
                      minReductionHashAggr: 0.99
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 125 Data size: 1500 Basic stats: 
PARTIAL Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        null sort order: a
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 125 Data size: 1500 Basic stats: 
PARTIAL Column stats: COMPLETE
                        value expressions: _col1 (type: bigint)
            Execution mode: vectorized, llap
            LLAP IO: may be used (ACID table)
        Reducer 2 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: int)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
Column stats: COMPLETE
                Top N Key Operator
                  sort order: -
                  keys: _col1 (type: bigint)
                  null sort order: a
                  Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
Column stats: COMPLETE
                  top n: 10
                  Reduce Output Operator
                    key expressions: _col1 (type: bigint)
                    null sort order: a
                    sort order: -
                    Statistics: Num rows: 62 Data size: 744 Basic stats: 
PARTIAL Column stats: COMPLETE
                    TopN Hash Memory Usage: 0.04
                    value expressions: _col0 (type: int)
        Reducer 3 
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Select Operator
                expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 (type: 
bigint)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
Column stats: COMPLETE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 10 Data size: 120 Basic stats: PARTIAL 
Column stats: COMPLETE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 10 Data size: 120 Basic stats: 
PARTIAL Column stats: COMPLETE
                    table:
                        input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink
{noformat}

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138

FSStats should return "null" or "" if partitions are not present in its list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to