[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions

2021-02-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=553957&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-553957
 ]

ASF GitHub Bot logged work on HIVE-24764:
-

Author: ASF GitHub Bot
Created on: 17/Feb/21 23:07
Start Date: 17/Feb/21 23:07
Worklog Time Spent: 10m 
  Work Description: ashutoshc commented on pull request #1967:
URL: https://github.com/apache/hive/pull/1967#issuecomment-780913343


   +1 LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 553957)
Time Spent: 40m  (was: 0.5h)

> insert overwrite on a partition resets row count stats in other partitions
> --
>
> Key: HIVE-24764
> URL: https://issues.apache.org/jira/browse/HIVE-24764
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed 
> up. Subsequent queries end up with plans with PARTIAL stats. In certain 
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is null;
> -- select * from test_stats_2;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc 
> limit 10;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>   DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test_stats_2
>   Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Select Operator
> expressions: i (type: int)
> outputColumnNames: i
> Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: i (type: int)
>   minReductionHashAggr: 0.99
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 125 Data size: 1500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 125 Data size: 1500 Basic 
> stats: PARTIAL Column stats: COMPLETE
> value expressions: _col1 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: may be used (ACID table)
> Reducer 2 
> Execution mode: vectorized, llap
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
> Top N Key Operator
>   sort order: -
>   keys: _col1 (type: bigint)
>   null sort order: a
>   Statistics: Num rows: 62 Data size: 744 Basic stats: 
> PARTIAL Column stats: COMPLETE
>

[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions

2021-02-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=552863&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-552863
 ]

ASF GitHub Bot logged work on HIVE-24764:
-

Author: ASF GitHub Bot
Created on: 16/Feb/21 08:43
Start Date: 16/Feb/21 08:43
Worklog Time Spent: 10m 
  Work Description: rbalamohan commented on pull request #1967:
URL: https://github.com/apache/hive/pull/1967#issuecomment-779679400


   Closing this due to some issue in local repo.  
   
   Created https://github.com/apache/hive/pull/1980 to track this issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 552863)
Time Spent: 0.5h  (was: 20m)

> insert overwrite on a partition resets row count stats in other partitions
> --
>
> Key: HIVE-24764
> URL: https://issues.apache.org/jira/browse/HIVE-24764
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed 
> up. Subsequent queries end up with plans with PARTIAL stats. In certain 
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is null;
> -- select * from test_stats_2;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc 
> limit 10;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>   DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test_stats_2
>   Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Select Operator
> expressions: i (type: int)
> outputColumnNames: i
> Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: i (type: int)
>   minReductionHashAggr: 0.99
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 125 Data size: 1500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 125 Data size: 1500 Basic 
> stats: PARTIAL Column stats: COMPLETE
> value expressions: _col1 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: may be used (ACID table)
> Reducer 2 
> Execution mode: vectorized, llap
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
> Top N Key Operator
>   sort order: -
>   keys: _col1 (type: bigint)
>   null sort order: a
>

[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions

2021-02-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=552861&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-552861
 ]

ASF GitHub Bot logged work on HIVE-24764:
-

Author: ASF GitHub Bot
Created on: 16/Feb/21 08:42
Start Date: 16/Feb/21 08:42
Worklog Time Spent: 10m 
  Work Description: rbalamohan closed pull request #1967:
URL: https://github.com/apache/hive/pull/1967


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 552861)
Time Spent: 20m  (was: 10m)

> insert overwrite on a partition resets row count stats in other partitions
> --
>
> Key: HIVE-24764
> URL: https://issues.apache.org/jira/browse/HIVE-24764
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed 
> up. Subsequent queries end up with plans with PARTIAL stats. In certain 
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is null;
> -- select * from test_stats_2;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc 
> limit 10;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>   DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test_stats_2
>   Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Select Operator
> expressions: i (type: int)
> outputColumnNames: i
> Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: i (type: int)
>   minReductionHashAggr: 0.99
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 125 Data size: 1500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 125 Data size: 1500 Basic 
> stats: PARTIAL Column stats: COMPLETE
> value expressions: _col1 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: may be used (ACID table)
> Reducer 2 
> Execution mode: vectorized, llap
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
> Top N Key Operator
>   sort order: -
>   keys: _col1 (type: bigint)
>   null sort order: a
>   Statistics: Num rows: 62 Data size: 744 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   top n: 10
>  

[jira] [Work logged] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions

2021-02-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24764?focusedWorklogId=550653&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-550653
 ]

ASF GitHub Bot logged work on HIVE-24764:
-

Author: ASF GitHub Bot
Created on: 10/Feb/21 10:00
Start Date: 10/Feb/21 10:00
Worklog Time Spent: 10m 
  Work Description: rbalamohan opened a new pull request #1967:
URL: https://github.com/apache/hive/pull/1967


   https://issues.apache.org/jira/browse/HIVE-24764
   insert overwrite on a partition resets row count stats in other partitions
   
   ### What changes were proposed in this pull request?
   FSStatsAggregator::aggregateStats should return the value when the 
partitions are present in its statslist. Otherwise, it should return empty or 
null value. This would prevent stats from being overwritten in 
BasicStatsTask::updateStats for other partitions during 'insert overwrite' 
operation.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 550653)
Remaining Estimate: 0h
Time Spent: 10m

> insert overwrite on a partition resets row count stats in other partitions
> --
>
> Key: HIVE-24764
> URL: https://issues.apache.org/jira/browse/HIVE-24764
> Project: Hive
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed 
> up. Subsequent queries end up with plans with PARTIAL stats. In certain 
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is null;
> -- select * from test_stats_2;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc 
> limit 10;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>   DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test_stats_2
>   Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Select Operator
> expressions: i (type: int)
> outputColumnNames: i
> Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
> Group By Operator
>   aggregations: count()
>   keys: i (type: int)
>   minReductionHashAggr: 0.99
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 125 Data size: 1500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 125 Data size: 1500 Basic 
> stats: PARTIAL Column stats: COMPLETE
> value expressions: _col1 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: may be used (ACID table)
> Reducer 2 
> Execution mode: vectorized, llap
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> keys: KEY._col0 (type: int)
> m