[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions
[ https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=553957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-553957 ] ASF GitHub Bot logged work on HIVE-24764: - Author: ASF GitHub Bot Created on: 17/Feb/21 23:07 Start Date: 17/Feb/21 23:07 Worklog Time Spent: 10m Work Description: ashutoshc commented on pull request #1967: URL: https://github.com/apache/hive/pull/1967#issuecomment-780913343 +1 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 553957) Time Spent: 40m (was: 0.5h) > insert overwrite on a partition resets row count stats in other partitions > -- > > Key: HIVE-24764 > URL: https://issues.apache.org/jira/browse/HIVE-24764 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > After insert overwrite on a partition, stats on other partitions are messed > up. Subsequent queries end up with plans with PARTIAL stats. In certain > cases, this leads to suboptimal query plans. > {noformat} > drop table if exists test_stats; > drop table if exists test_stats_2; > create table test_stats(i int, j bigint); > create table test_stats_2(i int) partitioned by (j bigint); > insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); > -- select * from test_stats; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is not null; > -- After executing this statement, stat gets messed up. > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is null; > -- select * from test_stats_2; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > -- This would return "PARTIAL" stats instead of "COMPLETE" > explain select i, count(*) as c from test_stats_2 group by i order by c desc > limit 10; > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test_stats_2 > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Select Operator > expressions: i (type: int) > outputColumnNames: i > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > aggregations: count() > keys: i (type: int) > minReductionHashAggr: 0.99 > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 125 Data size: 1500 Basic stats: > PARTIAL Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 125 Data size: 1500 Basic > stats: PARTIAL Column stats: COMPLETE > value expressions: _col1 (type: bigint) > Execution mode: vectorized, llap > LLAP IO: may be used (ACID table) > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > mode: mergepartial > outputColumnNames: _col0, _col1 > Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL > Column stats: COMPLETE > Top N Key Operator > sort order: - > keys: _col1 (type: bigint) > null sort order: a > Statistics: Num rows: 62 Data size: 744 Basic stats: > PARTIAL Column stats: COMPLETE >
[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions
[ https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=552863&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-552863 ] ASF GitHub Bot logged work on HIVE-24764: - Author: ASF GitHub Bot Created on: 16/Feb/21 08:43 Start Date: 16/Feb/21 08:43 Worklog Time Spent: 10m Work Description: rbalamohan commented on pull request #1967: URL: https://github.com/apache/hive/pull/1967#issuecomment-779679400 Closing this due to some issue in local repo. Created https://github.com/apache/hive/pull/1980 to track this issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 552863) Time Spent: 0.5h (was: 20m) > insert overwrite on a partition resets row count stats in other partitions > -- > > Key: HIVE-24764 > URL: https://issues.apache.org/jira/browse/HIVE-24764 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > After insert overwrite on a partition, stats on other partitions are messed > up. Subsequent queries end up with plans with PARTIAL stats. In certain > cases, this leads to suboptimal query plans. > {noformat} > drop table if exists test_stats; > drop table if exists test_stats_2; > create table test_stats(i int, j bigint); > create table test_stats_2(i int) partitioned by (j bigint); > insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); > -- select * from test_stats; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is not null; > -- After executing this statement, stat gets messed up. > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is null; > -- select * from test_stats_2; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > -- This would return "PARTIAL" stats instead of "COMPLETE" > explain select i, count(*) as c from test_stats_2 group by i order by c desc > limit 10; > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test_stats_2 > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Select Operator > expressions: i (type: int) > outputColumnNames: i > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > aggregations: count() > keys: i (type: int) > minReductionHashAggr: 0.99 > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 125 Data size: 1500 Basic stats: > PARTIAL Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 125 Data size: 1500 Basic > stats: PARTIAL Column stats: COMPLETE > value expressions: _col1 (type: bigint) > Execution mode: vectorized, llap > LLAP IO: may be used (ACID table) > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > mode: mergepartial > outputColumnNames: _col0, _col1 > Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL > Column stats: COMPLETE > Top N Key Operator > sort order: - > keys: _col1 (type: bigint) > null sort order: a >
[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions
[ https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=552861&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-552861 ] ASF GitHub Bot logged work on HIVE-24764: - Author: ASF GitHub Bot Created on: 16/Feb/21 08:42 Start Date: 16/Feb/21 08:42 Worklog Time Spent: 10m Work Description: rbalamohan closed pull request #1967: URL: https://github.com/apache/hive/pull/1967 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 552861) Time Spent: 20m (was: 10m) > insert overwrite on a partition resets row count stats in other partitions > -- > > Key: HIVE-24764 > URL: https://issues.apache.org/jira/browse/HIVE-24764 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > After insert overwrite on a partition, stats on other partitions are messed > up. Subsequent queries end up with plans with PARTIAL stats. In certain > cases, this leads to suboptimal query plans. > {noformat} > drop table if exists test_stats; > drop table if exists test_stats_2; > create table test_stats(i int, j bigint); > create table test_stats_2(i int) partitioned by (j bigint); > insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); > -- select * from test_stats; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is not null; > -- After executing this statement, stat gets messed up. > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is null; > -- select * from test_stats_2; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > -- This would return "PARTIAL" stats instead of "COMPLETE" > explain select i, count(*) as c from test_stats_2 group by i order by c desc > limit 10; > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test_stats_2 > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Select Operator > expressions: i (type: int) > outputColumnNames: i > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > aggregations: count() > keys: i (type: int) > minReductionHashAggr: 0.99 > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 125 Data size: 1500 Basic stats: > PARTIAL Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 125 Data size: 1500 Basic > stats: PARTIAL Column stats: COMPLETE > value expressions: _col1 (type: bigint) > Execution mode: vectorized, llap > LLAP IO: may be used (ACID table) > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > mode: mergepartial > outputColumnNames: _col0, _col1 > Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL > Column stats: COMPLETE > Top N Key Operator > sort order: - > keys: _col1 (type: bigint) > null sort order: a > Statistics: Num rows: 62 Data size: 744 Basic stats: > PARTIAL Column stats: COMPLETE > top n: 10 >
[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions
[ https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=550653&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-550653 ] ASF GitHub Bot logged work on HIVE-24764: - Author: ASF GitHub Bot Created on: 10/Feb/21 10:00 Start Date: 10/Feb/21 10:00 Worklog Time Spent: 10m Work Description: rbalamohan opened a new pull request #1967: URL: https://github.com/apache/hive/pull/1967 https://issues.apache.org/jira/browse/HIVE-24764 insert overwrite on a partition resets row count stats in other partitions ### What changes were proposed in this pull request? FSStatsAggregator::aggregateStats should return the value when the partitions are present in its statslist. Otherwise, it should return empty or null value. This would prevent stats from being overwritten in BasicStatsTask::updateStats for other partitions during 'insert overwrite' operation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 550653) Remaining Estimate: 0h Time Spent: 10m > insert overwrite on a partition resets row count stats in other partitions > -- > > Key: HIVE-24764 > URL: https://issues.apache.org/jira/browse/HIVE-24764 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > After insert overwrite on a partition, stats on other partitions are messed > up. Subsequent queries end up with plans with PARTIAL stats. In certain > cases, this leads to suboptimal query plans. > {noformat} > drop table if exists test_stats; > drop table if exists test_stats_2; > create table test_stats(i int, j bigint); > create table test_stats_2(i int) partitioned by (j bigint); > insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); > -- select * from test_stats; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is not null; > -- After executing this statement, stat gets messed up. > insert overwrite table test_stats_2 partition(j) select i, j from test_stats > where j is null; > -- select * from test_stats_2; > 1 1 > 2 2 > 3 3 > 4 4 > 5 > -- This would return "PARTIAL" stats instead of "COMPLETE" > explain select i, count(*) as c from test_stats_2 group by i order by c desc > limit 10; > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE) > DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: test_stats_2 > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Select Operator > expressions: i (type: int) > outputColumnNames: i > Statistics: Num rows: 125 Data size: 500 Basic stats: > PARTIAL Column stats: COMPLETE > Group By Operator > aggregations: count() > keys: i (type: int) > minReductionHashAggr: 0.99 > mode: hash > outputColumnNames: _col0, _col1 > Statistics: Num rows: 125 Data size: 1500 Basic stats: > PARTIAL Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: int) > null sort order: a > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 125 Data size: 1500 Basic > stats: PARTIAL Column stats: COMPLETE > value expressions: _col1 (type: bigint) > Execution mode: vectorized, llap > LLAP IO: may be used (ACID table) > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > keys: KEY._col0 (type: int) > m