subject:"\"\\\\\\\[jira\\\\\\\] \\\\\\\[Work logged\\\\\\\] \\\\\\\(HIVE\\\\\\\-22239\\\\\\\) Scale data size using column value ranges\""

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=328797&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-328797
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 15/Oct/19 21:37
Start Date: 15/Oct/19 21:37
Worklog Time Spent: 10m 
  Work Description: asfgit commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 328797)
Time Spent: 5h 20m  (was: 5h 10m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.05.patch, HIVE-22239.06.patch, 
> HIVE-22239.patch
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=326934&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326934
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 11/Oct/19 14:36
Start Date: 11/Oct/19 14:36
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r334016828
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -944,7 +948,7 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 } else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) {
   cs.setAvgColLen(JavaDataModel.get().lengthOfDate());
   // epoch, days since epoch
-  cs.setRange(0, 25201);
+  cs.setRange(DATE_RANGE_LOWER_LIMIT, DATE_RANGE_UPPER_LIMIT);
 
 Review comment:
   Yeah, this is a heuristic... No matter what you do, you will always get it 
wrong in some cases. I guess the idea is to target the most common case. The 
solution to overestimation/underestimation is to compute column stats as you 
mentioned, we do not want to let user tune this too.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326934)
Time Spent: 5h 10m  (was: 5h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.05.patch, HIVE-22239.patch
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=326933&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326933
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 11/Oct/19 14:23
Start Date: 11/Oct/19 14:23
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r334016828
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -944,7 +948,7 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 } else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) {
   cs.setAvgColLen(JavaDataModel.get().lengthOfDate());
   // epoch, days since epoch
-  cs.setRange(0, 25201);
+  cs.setRange(DATE_RANGE_LOWER_LIMIT, DATE_RANGE_UPPER_LIMIT);
 
 Review comment:
   Yeah, this is a heuristic... No matter what you do, you will always get it 
wrong in some cases. I guess the idea is the target the most common case. The 
solution to overestimation/underestimation is to compute column stats as you 
mentioned, we do not want to let user tune this too.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326933)
Time Spent: 5h  (was: 4h 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.05.patch, HIVE-22239.patch
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=326800&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326800
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 11/Oct/19 09:00
Start Date: 11/Oct/19 09:00
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r333889646
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -2039,6 +2117,8 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 pred = jop.getConf().getResidualFilterExprs().get(0);
   }
   // evaluate filter expression and update statistics
+  final boolean uniformWithinRange = HiveConf.getBoolVar(
 
 Review comment:
   unused varibale (remove before committing)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326800)
Time Spent: 4h 40m  (was: 4.5h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.05.patch, HIVE-22239.patch
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=326801&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326801
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 11/Oct/19 09:00
Start Date: 11/Oct/19 09:00
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r333892055
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -944,7 +948,7 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 } else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) {
   cs.setAvgColLen(JavaDataModel.get().lengthOfDate());
   // epoch, days since epoch
-  cs.setRange(0, 25201);
+  cs.setRange(DATE_RANGE_LOWER_LIMIT, DATE_RANGE_UPPER_LIMIT);
 
 Review comment:
   I feel like we should be setting this range to the maximum as possible
   * let's say the user has data from 1920-1940 
   * and  submits a query to <1930 which would mean 1/2 of the rows
   * if the stats are estimated hive will go straight to 0 rows with the new 
uniform estimation logic (earlier it was 1/3 or something)
   
   so I think either this should be set for the whole range; or have toggle to 
change how this supposed to workor our archelogist end users will get angry 
and grab some rusty :dagger: to cut our necks :D
   
   ...or we should tell them to calculate statistics properly?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326801)
Time Spent: 4h 50m  (was: 4h 40m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.05.patch, HIVE-22239.patch
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=326799&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326799
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 11/Oct/19 09:00
Start Date: 11/Oct/19 09:00
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r333889584
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -1946,6 +2022,8 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 pred = jop.getConf().getResidualFilterExprs().get(0);
   }
   // evaluate filter expression and update statistics
+  final boolean uniformWithinRange = HiveConf.getBoolVar(
 
 Review comment:
   unused varibale (remove before committing)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326799)
Time Spent: 4h 40m  (was: 4.5h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.05.patch, HIVE-22239.patch
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=326131&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-326131
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 10/Oct/19 06:00
Start Date: 10/Oct/19 06:00
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on issue #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#issuecomment-540389710
 
 
   @kgyrtkirk , I pushed a new commit containing only the range change and 
addressing your comments ; I will upload a follow-up for the timestamp column 
stats propagation in a new PR. Can you take another look?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 326131)
Time Spent: 4.5h  (was: 4h 20m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.03.patch, HIVE-22239.04.patch, HIVE-22239.04.patch, 
> HIVE-22239.05.patch, HIVE-22239.patch
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325750&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325750
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 15:08
Start Date: 09/Oct/19 15:08
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r333070056
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -967,13 +979,23 @@ private long evaluateComparator(Statistics stats, 
AnnotateStatsProcCtx aspCtx, E
   if (minValue > value) {
 return 0;
   }
+  if (uniformWithinRange) {
+// Assuming uniform distribution, we can use the range to 
calculate
+// new estimate for the number of rows
+return Math.round(((double) (value - minValue) / (maxValue - 
minValue)) * numRows);
 
 Review comment:
   Good catch. I fixed that in latest patch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325750)
Time Spent: 4h 20m  (was: 4h 10m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325714&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325714
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 14:04
Start Date: 09/Oct/19 14:04
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r333032465
 
 

 ##
 File path: ql/src/test/results/clientpositive/llap/subquery_select.q.out
 ##
 @@ -3918,14 +3918,14 @@ STAGE PLANS:
   Statistics: Num rows: 26 Data size: 208 Basic stats: 
COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: p_partkey BETWEEN 1 AND 2 (type: 
boolean)
-Statistics: Num rows: 8 Data size: 64 Basic stats: 
COMPLETE Column stats: COMPLETE
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
 Select Operator
   expressions: p_size (type: int)
   outputColumnNames: p_size
-  Statistics: Num rows: 8 Data size: 64 Basic stats: 
COMPLETE Column stats: COMPLETE
+  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
   Group By Operator
 aggregations: max(p_size)
-minReductionHashAggr: 0.875
+minReductionHashAggr: 0.0
 
 Review comment:
   Range is `15103`-`195606` for `p_partkey` column, out of 26 rows. Hence, the 
estimate of `1` seems right.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325714)
Time Spent: 4h 10m  (was: 4h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325679&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325679
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 13:10
Start Date: 09/Oct/19 13:10
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r333003637
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -856,8 +856,15 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String tab
 } else if (colTypeLowerCase.equals(serdeConstants.BINARY_TYPE_NAME)) {
   cs.setAvgColLen(csd.getBinaryStats().getAvgColLen());
   cs.setNumNulls(csd.getBinaryStats().getNumNulls());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
 
 Review comment:
   I think it is a good idea and we are not in a hurry... Let's do the right 
thing.
   I have created https://issues.apache.org/jira/browse/HIVE-22311.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325679)
Time Spent: 4h  (was: 3h 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325677&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325677
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 12:52
Start Date: 09/Oct/19 12:52
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332980654
 
 

 ##
 File path: 
ql/src/test/results/clientpositive/llap/retry_failure_stat_changes.q.out
 ##
 @@ -139,25 +139,25 @@ Stage-0
   PARTITION_ONLY_SHUFFLE [RS_12]
 Group By Operator [GBY_11] (rows=1/1 width=8)
   Output:["_col0"],aggregations:["sum(_col0)"]
-  Select Operator [SEL_9] (rows=1/3 width=8)
+  Select Operator [SEL_9] (rows=4/3 width=8)
 Output:["_col0"]
-Merge Join Operator [MERGEJOIN_30] (rows=1/3 width=8)
+Merge Join Operator [MERGEJOIN_30] (rows=4/3 width=8)
   Conds:RS_6._col0=RS_7._col0(Inner),Output:["_col0","_col1"]
 <-Map 1 [SIMPLE_EDGE] llap
   SHUFFLE [RS_6]
 PartitionCols:_col0
-Select Operator [SEL_2] (rows=1/5 width=4)
+Select Operator [SEL_2] (rows=7/5 width=4)
   Output:["_col0"]
-  Filter Operator [FIL_18] (rows=1/5 width=4)
+  Filter Operator [FIL_18] (rows=7/5 width=4)
 
 Review comment:
   Since it was by-design, I have disabled the uniform stats for this specific 
test.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325677)
Time Spent: 3h 50m  (was: 3h 40m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325676&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325676
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 12:51
Start Date: 09/Oct/19 12:51
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332994124
 
 

 ##
 File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/TimestampColumnStatsAggregator.java
 ##
 @@ -0,0 +1,358 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hadoop.hive.metastore.columnstats.aggr;
+
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimator;
+import org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsData;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj;
+import org.apache.hadoop.hive.metastore.api.Timestamp;
+import org.apache.hadoop.hive.metastore.api.TimestampColumnStatsData;
+import org.apache.hadoop.hive.metastore.api.MetaException;
+import 
org.apache.hadoop.hive.metastore.columnstats.cache.TimestampColumnStatsDataInspector;
+import 
org.apache.hadoop.hive.metastore.utils.MetaStoreServerUtils.ColStatsObjWithSourceInfo;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static 
org.apache.hadoop.hive.metastore.columnstats.ColumnsStatsUtils.timestampInspectorFromStats;
+
+public class TimestampColumnStatsAggregator extends ColumnStatsAggregator 
implements
+IExtrapolatePartStatus {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TimestampColumnStatsAggregator.class);
+
+  @Override
+  public ColumnStatisticsObj aggregate(List 
colStatsWithSourceInfo,
+List partNames, boolean areAllPartsFound) throws 
MetaException {
+ColumnStatisticsObj statsObj = null;
+String colType = null;
+String colName = null;
+// check if all the ColumnStatisticsObjs contain stats and all the ndv are
+// bitvectors
+boolean doAllPartitionContainStats = partNames.size() == 
colStatsWithSourceInfo.size();
+NumDistinctValueEstimator ndvEstimator = null;
+for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) {
+  ColumnStatisticsObj cso = csp.getColStatsObj();
+  if (statsObj == null) {
+colName = cso.getColName();
+colType = cso.getColType();
+statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, 
colType,
+cso.getStatsData().getSetField());
+LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, 
doAllPartitionContainStats);
+  }
+  TimestampColumnStatsDataInspector timestampColumnStats = 
timestampInspectorFromStats(cso);
+
+  if (timestampColumnStats.getNdvEstimator() == null) {
+ndvEstimator = null;
+break;
+  } else {
+// check if all of the bit vectors can merge
+NumDistinctValueEstimator estimator = 
timestampColumnStats.getNdvEstimator();
+if (ndvEstimator == null) {
+  ndvEstimator = estimator;
+} else {
+  if (ndvEstimator.canMerge(estimator)) {
+continue;
+  } else {
+ndvEstimator = null;
+break;
+  }
+}
+  }
+}
+if (ndvEstimator != null) {
+  ndvEstimator = NumDistinctValueEstimatorFactory
+  .getEmptyNumDistinctValueEstimator(ndvEstimator);
+}
+LOG.debug("all of the bit vectors can merge for " + colName + " is " + 
(ndvEstimator != null));
+ColumnStatisticsData columnStatisticsData = new ColumnStatisticsData();
+if (doAllPartitionContainStats || colStatsWithSourceInfo.size() < 2) {
+  TimestampColumnStatsDataInspector aggregateData = null;
+  long lowerBound = 0;
+  lo

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325667&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325667
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 12:45
Start Date: 09/Oct/19 12:45
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332991505
 
 

 ##
 File path: 
standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
 ##
 @@ -562,14 +562,27 @@ struct DateColumnStatsData {
 5: optional binary bitVectors
 }
 
+struct Timestamp {
+1: required i64 secondsSinceEpoch
 
 Review comment:
   I do not think it is too complicated but it will imply changes in the 
metastore tables that store these values too. I did not want to change the 
internal representation of column stats for timestamp type in this patch, that 
is why I introduced the type but did not change the internal representation 
based on seconds. I created https://issues.apache.org/jira/browse/HIVE-22309.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325667)
Time Spent: 3.5h  (was: 3h 20m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325663&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325663
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 12:32
Start Date: 09/Oct/19 12:32
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332979645
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -856,8 +856,15 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String tab
 } else if (colTypeLowerCase.equals(serdeConstants.BINARY_TYPE_NAME)) {
   cs.setAvgColLen(csd.getBinaryStats().getAvgColLen());
   cs.setNumNulls(csd.getBinaryStats().getNumNulls());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  cs.setNumNulls(csd.getTimestampStats().getNumNulls());
+  Long lowVal = (csd.getTimestampStats().getLowValue() != null) ? 
csd.getTimestampStats().getLowValue()
+  .getSecondsSinceEpoch() : null;
+  Long highVal = (csd.getTimestampStats().getHighValue() != null) ? 
csd.getTimestampStats().getHighValue()
+  .getSecondsSinceEpoch() : null;
+  cs.setRange(lowVal, highVal);
 
 Review comment:
   Yeah, I did not want to change the information that we store for timestamp 
type (long representing seconds epoch), note that this patch only changes the 
way we read the data. I agree we could use finer granularity for different 
types.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325663)
Time Spent: 3h 20m  (was: 3h 10m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325660&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325660
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 12:20
Start Date: 09/Oct/19 12:20
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332980654
 
 

 ##
 File path: 
ql/src/test/results/clientpositive/llap/retry_failure_stat_changes.q.out
 ##
 @@ -139,25 +139,25 @@ Stage-0
   PARTITION_ONLY_SHUFFLE [RS_12]
 Group By Operator [GBY_11] (rows=1/1 width=8)
   Output:["_col0"],aggregations:["sum(_col0)"]
-  Select Operator [SEL_9] (rows=1/3 width=8)
+  Select Operator [SEL_9] (rows=4/3 width=8)
 Output:["_col0"]
-Merge Join Operator [MERGEJOIN_30] (rows=1/3 width=8)
+Merge Join Operator [MERGEJOIN_30] (rows=4/3 width=8)
   Conds:RS_6._col0=RS_7._col0(Inner),Output:["_col0","_col1"]
 <-Map 1 [SIMPLE_EDGE] llap
   SHUFFLE [RS_6]
 PartitionCols:_col0
-Select Operator [SEL_2] (rows=1/5 width=4)
+Select Operator [SEL_2] (rows=7/5 width=4)
   Output:["_col0"]
-  Filter Operator [FIL_18] (rows=1/5 width=4)
+  Filter Operator [FIL_18] (rows=7/5 width=4)
 
 Review comment:
   If it was by-design, do you think I should disable the uniform stats for 
this specific test?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325660)
Time Spent: 3h 10m  (was: 3h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-09 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325658&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325658
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 09/Oct/19 12:18
Start Date: 09/Oct/19 12:18
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332979645
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -856,8 +856,15 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String tab
 } else if (colTypeLowerCase.equals(serdeConstants.BINARY_TYPE_NAME)) {
   cs.setAvgColLen(csd.getBinaryStats().getAvgColLen());
   cs.setNumNulls(csd.getBinaryStats().getNumNulls());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  cs.setNumNulls(csd.getTimestampStats().getNumNulls());
+  Long lowVal = (csd.getTimestampStats().getLowValue() != null) ? 
csd.getTimestampStats().getLowValue()
+  .getSecondsSinceEpoch() : null;
+  Long highVal = (csd.getTimestampStats().getHighValue() != null) ? 
csd.getTimestampStats().getHighValue()
+  .getSecondsSinceEpoch() : null;
+  cs.setRange(lowVal, highVal);
 
 Review comment:
   Yeah, I did not want to change the information that we store for timestamp 
type, note that this patch only changes the way we read the data. I agree we 
could use finer granularity for different types.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325658)
Time Spent: 3h  (was: 2h 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325379&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325379
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 23:05
Start Date: 08/Oct/19 23:05
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332770578
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -935,8 +942,11 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 cs.setNumTrues(Math.max(1, numRows/2));
 cs.setNumFalses(Math.max(1, numRows/2));
 cs.setAvgColLen(JavaDataModel.get().primitive1());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  // epoch, seconds since epoch
+  cs.setRange(0, 2177452799L);
 
 Review comment:
   I answer to same comment to @miklosgergely above, please see my response. I 
used a new heuristic, I do not think existing was a very good one...
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325379)
Time Spent: 2h 50m  (was: 2h 40m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325378&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325378
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 23:02
Start Date: 08/Oct/19 23:02
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332736102
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -935,8 +942,11 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 cs.setNumTrues(Math.max(1, numRows/2));
 cs.setNumFalses(Math.max(1, numRows/2));
 cs.setAvgColLen(JavaDataModel.get().primitive1());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  // epoch, seconds since epoch
+  cs.setRange(0, 2177452799L);
 
 Review comment:
   I do not think this is critical, but I explored a bit and this seems to be a 
poor choice for a heuristic, since in most cases it will lead to 
underestimation of the data size (since most users will not have dates starting 
from 1970).
   I will take as lower limit `01-01-1999` and as upper limit `12-31-2024` 
(mentioned by Gopal). Let me know if you see any cons with this approach.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325378)
Time Spent: 2h 40m  (was: 2.5h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325350&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325350
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 21:44
Start Date: 08/Oct/19 21:44
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332747726
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -316,7 +321,7 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 
 protected long evaluateExpression(Statistics stats, ExprNodeDesc pred,
 AnnotateStatsProcCtx aspCtx, List neededCols,
-Operator op, long currNumRows) throws SemanticException {
+Operator op, long currNumRows, boolean uniformWithinRange) throws 
SemanticException {
 
 Review comment:
   I have used the `AnnotateStatsProcCtx` to hold that value, thanks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325350)
Time Spent: 2.5h  (was: 2h 20m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325332&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325332
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 21:16
Start Date: 08/Oct/19 21:16
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332736102
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -935,8 +942,11 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 cs.setNumTrues(Math.max(1, numRows/2));
 cs.setNumFalses(Math.max(1, numRows/2));
 cs.setAvgColLen(JavaDataModel.get().primitive1());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  // epoch, seconds since epoch
+  cs.setRange(0, 2177452799L);
 
 Review comment:
   I do not think this is critical, but I explored a bit and this seems to be a 
poor choice for a heuristic, since in most cases it will lead to 
underestimation of the data size (since most users will not have dates starting 
from 1970).
   I will take as lower limit `01-01-2015` (epoch for ORC) and as upper limit 
`12-31-2024`. Let me know if you see any cons with this approach.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325332)
Time Spent: 2h 20m  (was: 2h 10m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=325331&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-325331
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 21:12
Start Date: 08/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332736102
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -935,8 +942,11 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 cs.setNumTrues(Math.max(1, numRows/2));
 cs.setNumFalses(Math.max(1, numRows/2));
 cs.setAvgColLen(JavaDataModel.get().primitive1());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  // epoch, seconds since epoch
+  cs.setRange(0, 2177452799L);
 
 Review comment:
   I do not think this is critical, but I explored a bit and this seems to be a 
poor choice for a heuristic, since in most cases it will lead to 
underestimation of the data size (since most users will not have dates starting 
from 1970).
   I will take as lower limit `01-01-2015` (epoch for ORC) and as upper limit 
`12-31-2025`. Let me know if you see any cons with this approach.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 325331)
Time Spent: 2h 10m  (was: 2h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324949&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324949
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332368069
 
 

 ##
 File path: ql/src/test/results/clientpositive/llap/semijoin_reddedup.q.out
 ##
 @@ -258,9 +258,12 @@ STAGE PLANS:
 Tez
  A masked pattern was here 
   Edges:
+Map 1 <- Reducer 10 (BROADCAST_EDGE)
+Map 11 <- Reducer 10 (BROADCAST_EDGE)
+Reducer 10 <- Reducer 9 (CUSTOM_SIMPLE_EDGE)
 Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE)
 Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Reducer 9 (SIMPLE_EDGE)
-Reducer 4 <- Map 10 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE)
+Reducer 4 <- Map 11 (SIMPLE_EDGE), Reducer 3 (SIMPLE_EDGE)
 
 Review comment:
   I'm not 100% sure what's the goal of this test; but I think we should 
probably disable the uniform distribution for it to retain its original goal
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324949)
Time Spent: 1.5h  (was: 1h 20m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324951&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324951
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332377711
 
 

 ##
 File path: 
standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift
 ##
 @@ -562,14 +562,27 @@ struct DateColumnStatsData {
 5: optional binary bitVectors
 }
 
+struct Timestamp {
+1: required i64 secondsSinceEpoch
 
 Review comment:
   I'm afraid that there will be a downside that we are throwing away precision 
- and because of that we may get into some troubles later:
   
   If we do truncate to seconds; we may not be able to extend the timestamp 
logic to the stats optimizer - as we are not working with the real values.
   
   consider the following
   ```sql
   select '2019-11-11 11:11:11.400' < '2019-11-11 11:11:11.300'
   ```
   if we round to seconds; consider that the left side comes from a table - a 
the columns maxvalue; the stats optimizer could deduce a "true" for the above
   
   Would it complicate things much to use a non-rounded timestamp - and retain 
miliseconds/microsendond as well ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324951)
Time Spent: 1h 50m  (was: 1h 40m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324952&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324952
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332352613
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -276,9 +279,11 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 ExprNodeDesc pred = fop.getConf().getPredicate();
 
 // evaluate filter expression and update statistics
+final boolean uniformWithinRange = HiveConf.getBoolVar(
+aspCtx.getConf(), 
HiveConf.ConfVars.HIVE_STATS_RANGE_SELECTIVITY_UNIFORM_DISTRIBUTION);
 aspCtx.clearAffectedColumns();
 long newNumRows = evaluateExpression(parentStats, pred, aspCtx,
-neededCols, fop, parentStats.getNumRows());
+neededCols, fop, parentStats.getNumRows(), uniformWithinRange);
 
 Review comment:
   I don't think we should pass a boolean here; as aspCtx is there
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324952)
Time Spent: 2h  (was: 1h 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324943&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324943
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332357556
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -856,8 +856,15 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String tab
 } else if (colTypeLowerCase.equals(serdeConstants.BINARY_TYPE_NAME)) {
   cs.setAvgColLen(csd.getBinaryStats().getAvgColLen());
   cs.setNumNulls(csd.getBinaryStats().getNumNulls());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  cs.setNumNulls(csd.getTimestampStats().getNumNulls());
+  Long lowVal = (csd.getTimestampStats().getLowValue() != null) ? 
csd.getTimestampStats().getLowValue()
+  .getSecondsSinceEpoch() : null;
+  Long highVal = (csd.getTimestampStats().getHighValue() != null) ? 
csd.getTimestampStats().getHighValue()
+  .getSecondsSinceEpoch() : null;
+  cs.setRange(lowVal, highVal);
 
 Review comment:
   I don't feel this fortunatewe do know the low/high value but we faltten 
it to some number...instead of this we would need properly typed ranges - for 
decimal we already throw away all our knowledge if it runs beyond long limits
   
   the current changes follow the existing traditions - if we decide to change 
that ; it should be done in a separate ticket
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324943)
Time Spent: 1h  (was: 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324941&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324941
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332365302
 
 

 ##
 File path: 
ql/src/test/results/clientpositive/llap/retry_failure_stat_changes.q.out
 ##
 @@ -139,25 +139,25 @@ Stage-0
   PARTITION_ONLY_SHUFFLE [RS_12]
 Group By Operator [GBY_11] (rows=1/1 width=8)
   Output:["_col0"],aggregations:["sum(_col0)"]
-  Select Operator [SEL_9] (rows=1/3 width=8)
+  Select Operator [SEL_9] (rows=4/3 width=8)
 Output:["_col0"]
-Merge Join Operator [MERGEJOIN_30] (rows=1/3 width=8)
+Merge Join Operator [MERGEJOIN_30] (rows=4/3 width=8)
   Conds:RS_6._col0=RS_7._col0(Inner),Output:["_col0","_col1"]
 <-Map 1 [SIMPLE_EDGE] llap
   SHUFFLE [RS_6]
 PartitionCols:_col0
-Select Operator [SEL_2] (rows=1/5 width=4)
+Select Operator [SEL_2] (rows=7/5 width=4)
   Output:["_col0"]
-  Filter Operator [FIL_18] (rows=1/5 width=4)
+  Filter Operator [FIL_18] (rows=7/5 width=4)
 
 Review comment:
   there was a "by-design" underestimation in this testcase...now we have good 
estimate :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324941)
Time Spent: 40m  (was: 0.5h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324953&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324953
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332355126
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -967,13 +979,23 @@ private long evaluateComparator(Statistics stats, 
AnnotateStatsProcCtx aspCtx, E
   if (minValue > value) {
 return 0;
   }
+  if (uniformWithinRange) {
+// Assuming uniform distribution, we can use the range to 
calculate
+// new estimate for the number of rows
+return Math.round(((double) (value - minValue) / (maxValue - 
minValue)) * numRows);
 
 Review comment:
   I think we will probably hit a divide by zero here when max=min; I don't see 
any preceeding conditionals covering for that (however there can be...)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324953)
Time Spent: 2h  (was: 1h 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324944&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324944
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332382671
 
 

 ##
 File path: ql/src/test/results/clientpositive/llap/bucket_map_join_tez2.q.out
 ##
 @@ -759,50 +759,56 @@ STAGE PLANS:
   Statistics: Num rows: 500 Data size: 2000 Basic stats: 
COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: (key > 2) (type: boolean)
-Statistics: Num rows: 166 Data size: 664 Basic stats: 
COMPLETE Column stats: COMPLETE
+Statistics: Num rows: 498 Data size: 1992 Basic stats: 
COMPLETE Column stats: COMPLETE
 Select Operator
   expressions: key (type: int)
   outputColumnNames: _col0
-  Statistics: Num rows: 166 Data size: 664 Basic stats: 
COMPLETE Column stats: COMPLETE
-  Map Join Operator
-condition map:
- Inner Join 0 to 1
-keys:
-  0 _col0 (type: int)
-  1 _col0 (type: int)
-outputColumnNames: _col0, _col1
-input vertices:
-  1 Map 2
-Statistics: Num rows: 272 Data size: 2176 Basic stats: 
COMPLETE Column stats: COMPLETE
-File Output Operator
-  compressed: false
-  Statistics: Num rows: 272 Data size: 2176 Basic 
stats: COMPLETE Column stats: COMPLETE
-  table:
-  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
-  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
-  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
+  Statistics: Num rows: 498 Data size: 1992 Basic stats: 
COMPLETE Column stats: COMPLETE
+  Reduce Output Operator
+key expressions: _col0 (type: int)
+sort order: +
+Map-reduce partition columns: _col0 (type: int)
+Statistics: Num rows: 498 Data size: 1992 Basic stats: 
COMPLETE Column stats: COMPLETE
 
 Review comment:
   disable enhancement; seems to be interfering with the test
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324944)
Time Spent: 1h  (was: 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324950&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324950
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332383859
 
 

 ##
 File path: ql/src/test/results/clientpositive/llap/subquery_select.q.out
 ##
 @@ -3918,14 +3918,14 @@ STAGE PLANS:
   Statistics: Num rows: 26 Data size: 208 Basic stats: 
COMPLETE Column stats: COMPLETE
   Filter Operator
 predicate: p_partkey BETWEEN 1 AND 2 (type: 
boolean)
-Statistics: Num rows: 8 Data size: 64 Basic stats: 
COMPLETE Column stats: COMPLETE
+Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
 Select Operator
   expressions: p_size (type: int)
   outputColumnNames: p_size
-  Statistics: Num rows: 8 Data size: 64 Basic stats: 
COMPLETE Column stats: COMPLETE
+  Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
   Group By Operator
 aggregations: max(p_size)
-minReductionHashAggr: 0.875
+minReductionHashAggr: 0.0
 
 Review comment:
   the computed range is most likely empty...
   these changes suggest to me that something is not entirely rightis this 
expected?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324950)
Time Spent: 1h 40m  (was: 1.5h)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324945&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324945
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332355776
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -856,8 +856,15 @@ public static ColStatistics 
getColStatistics(ColumnStatisticsObj cso, String tab
 } else if (colTypeLowerCase.equals(serdeConstants.BINARY_TYPE_NAME)) {
   cs.setAvgColLen(csd.getBinaryStats().getAvgColLen());
   cs.setNumNulls(csd.getBinaryStats().getNumNulls());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
 
 Review comment:
   I start to feel like having these 2 changes separetly from eachother might 
have been good
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324945)
Time Spent: 1h  (was: 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324946&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324946
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332353993
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
 ##
 @@ -316,7 +321,7 @@ public Object process(Node nd, Stack stack, 
NodeProcessorCtx procCtx,
 
 protected long evaluateExpression(Statistics stats, ExprNodeDesc pred,
 AnnotateStatsProcCtx aspCtx, List neededCols,
-Operator op, long currNumRows) throws SemanticException {
+Operator op, long currNumRows, boolean uniformWithinRange) throws 
SemanticException {
 
 Review comment:
   this boolean is not used in this function ; but aspCtx can be used to obtain 
it - I would suggest to either:
   * leave the function signature as is and get the boolean when its needed 
from conf
   * this class is constructed and used to optimize a single query - so we may 
pass the conf on consturction and create a finalized field from this info
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324946)
Time Spent: 1h  (was: 50m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324942&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324942
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332352109
 
 

 ##
 File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
 ##
 @@ -2537,6 +2537,11 @@ private static void 
populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal
 "When estimating output rows for a join involving multiple columns, 
the default behavior assumes" +
 "the columns are independent. Setting this flag to true will cause the 
estimator to assume" +
 "the columns are correlated."),
+
HIVE_STATS_RANGE_SELECTIVITY_UNIFORM_DISTRIBUTION("hive.stats.filter.range.uniform",
 true,
+"When estimating output rows from a condition, if a range predicate is 
applied over a column and the" +
+"minimum and maximum values for that column are available, assume 
uniform distribution of values" +
+"accross that range and scales number of rows proportionally. If this 
is set to false, default" +
+"selectivity value is used."),
 
 Review comment:
   small  thing: could you add spaces at the end; because right now it contains 
words like "theminimum" and "valuesacross" 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324942)
Time Spent: 50m  (was: 40m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324940&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324940
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332362215
 
 

 ##
 File path: ql/src/test/results/clientpositive/alter_table_update_status.q.out
 ##
 @@ -453,8 +453,8 @@ POSTHOOK: type: DESCTABLE
 POSTHOOK: Input: default@datatype_stats_n0
 col_name   ts  
 
 data_type  timestamp   
 
-min1325379723  
 
-max1325379723  
 
+min2012-01-01 01:02:03 
 
 
 Review comment:
   human readable timestamp value 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324940)
Time Spent: 0.5h  (was: 20m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324948&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324948
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332358389
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -935,8 +942,11 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 cs.setNumTrues(Math.max(1, numRows/2));
 cs.setNumFalses(Math.max(1, numRows/2));
 cs.setAvgColLen(JavaDataModel.get().primitive1());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  // epoch, seconds since epoch
+  cs.setRange(0, 2177452799L);
 
 Review comment:
   not 100% sure about this; but I think the minvalue should not be 0; or we 
may optimize away pre 1970 filter conditions
   and why it's exactly 217745799 :) I might be missing some context...
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324948)
Time Spent: 1h 20m  (was: 1h 10m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324947&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324947
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:47
Start Date: 08/Oct/19 08:47
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332373527
 
 

 ##
 File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/TimestampColumnStatsAggregator.java
 ##
 @@ -0,0 +1,358 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hadoop.hive.metastore.columnstats.aggr;
+
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimator;
+import org.apache.hadoop.hive.common.ndv.NumDistinctValueEstimatorFactory;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsData;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj;
+import org.apache.hadoop.hive.metastore.api.Timestamp;
+import org.apache.hadoop.hive.metastore.api.TimestampColumnStatsData;
+import org.apache.hadoop.hive.metastore.api.MetaException;
+import 
org.apache.hadoop.hive.metastore.columnstats.cache.TimestampColumnStatsDataInspector;
+import 
org.apache.hadoop.hive.metastore.utils.MetaStoreServerUtils.ColStatsObjWithSourceInfo;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import static 
org.apache.hadoop.hive.metastore.columnstats.ColumnsStatsUtils.timestampInspectorFromStats;
+
+public class TimestampColumnStatsAggregator extends ColumnStatsAggregator 
implements
+IExtrapolatePartStatus {
+
+  private static final Logger LOG = 
LoggerFactory.getLogger(TimestampColumnStatsAggregator.class);
+
+  @Override
+  public ColumnStatisticsObj aggregate(List 
colStatsWithSourceInfo,
+List partNames, boolean areAllPartsFound) throws 
MetaException {
+ColumnStatisticsObj statsObj = null;
+String colType = null;
+String colName = null;
+// check if all the ColumnStatisticsObjs contain stats and all the ndv are
+// bitvectors
+boolean doAllPartitionContainStats = partNames.size() == 
colStatsWithSourceInfo.size();
+NumDistinctValueEstimator ndvEstimator = null;
+for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) {
+  ColumnStatisticsObj cso = csp.getColStatsObj();
+  if (statsObj == null) {
+colName = cso.getColName();
+colType = cso.getColType();
+statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, 
colType,
+cso.getStatsData().getSetField());
+LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, 
doAllPartitionContainStats);
+  }
+  TimestampColumnStatsDataInspector timestampColumnStats = 
timestampInspectorFromStats(cso);
+
+  if (timestampColumnStats.getNdvEstimator() == null) {
+ndvEstimator = null;
+break;
+  } else {
+// check if all of the bit vectors can merge
+NumDistinctValueEstimator estimator = 
timestampColumnStats.getNdvEstimator();
+if (ndvEstimator == null) {
+  ndvEstimator = estimator;
+} else {
+  if (ndvEstimator.canMerge(estimator)) {
+continue;
+  } else {
+ndvEstimator = null;
+break;
+  }
+}
+  }
+}
+if (ndvEstimator != null) {
+  ndvEstimator = NumDistinctValueEstimatorFactory
+  .getEmptyNumDistinctValueEstimator(ndvEstimator);
+}
+LOG.debug("all of the bit vectors can merge for " + colName + " is " + 
(ndvEstimator != null));
+ColumnStatisticsData columnStatisticsData = new ColumnStatisticsData();
+if (doAllPartitionContainStats || colStatsWithSourceInfo.size() < 2) {
+  TimestampColumnStatsDataInspector aggregateData = null;
+  long lowerBound = 0;
+  lo

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-10-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=324914&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-324914
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 08/Oct/19 08:03
Start Date: 08/Oct/19 08:03
Worklog Time Spent: 10m 
  Work Description: miklosgergely commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787#discussion_r332380055
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
 ##
 @@ -935,8 +942,11 @@ else 
if(colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)){
 cs.setNumTrues(Math.max(1, numRows/2));
 cs.setNumFalses(Math.max(1, numRows/2));
 cs.setAvgColLen(JavaDataModel.get().primitive1());
-} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME) ||
-colTypeLowerCase.equals(serdeConstants.TIMESTAMPLOCALTZ_TYPE_NAME)) {
+} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
+  cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+  // epoch, seconds since epoch
+  cs.setRange(0, 2177452799L);
 
 Review comment:
   Is this 2038-12-31 11:59:59 PM GMT? Why is it hard coded to that? If there 
is a reason for that, then I suggest to introduce a more meaningful constant 
declaration, like this:
   
   // Explanation why the timestamp range upper limit should be this date
   private static final long TIMESTAMP_RANGE_UPPER_LIMIT = new 
SimpleDateFormat("-MM-dd HH:mm:ss zzz").parse("2038-12-31 23:59:59 
GMT").getTime() / 1000;
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 324914)
Time Spent: 20m  (was: 10m)

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.01.patch, HIVE-22239.02.patch, 
> HIVE-22239.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-22239) Scale data size using column value ranges

2019-09-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22239?focusedWorklogId=317871&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-317871
 ]

ASF GitHub Bot logged work on HIVE-22239:
-

Author: ASF GitHub Bot
Created on: 24/Sep/19 21:47
Start Date: 24/Sep/19 21:47
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #787: HIVE-22239
URL: https://github.com/apache/hive/pull/787
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 317871)
Remaining Estimate: 0h
Time Spent: 10m

> Scale data size using column value ranges
> -
>
> Key: HIVE-22239
> URL: https://issues.apache.org/jira/browse/HIVE-22239
> Project: Hive
>  Issue Type: Improvement
>  Components: Physical Optimizer
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22239.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, min/max values for columns are only used to determine whether a 
> certain range filter falls out of range and thus filters all rows or none at 
> all. If it does not, we just use a heuristic that the condition will filter 
> 1/3 of the input rows. Instead of using that heuristic, we can use another 
> one that assumes that data will be uniformly distributed across that range, 
> and calculate the selectivity for the condition accordingly.
> This patch also includes the propagation of min/max column values from 
> statistics to the optimizer for timestamp type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

37 matches

Mail list logo