thomasrebele commented on code in PR #603:
URL: https://github.com/apache/hive/pull/603#discussion_r2630431180
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:
##########
@@ -1524,6 +1531,77 @@ public Object process(Node nd, Stack<Node> stack,
NodeProcessorCtx procCtx,
return null;
}
+ /**
+ * If possible, sets the min / max value for the column based on the
aggregate function
+ * being calculated and its input.
+ */
+ private static void computeAggregateColumnMinMax(ColStatistics cs,
HiveConf conf, AggregationDesc agg, String aggType,
+ Statistics parentStats) throws SemanticException {
+ if (agg.getParameters() != null && agg.getParameters().size() == 1) {
+ ColStatistics parentCS = StatsUtils.getColStatisticsFromExpression(
+ conf, parentStats, agg.getParameters().get(0));
+ if (parentCS != null && parentCS.getRange() != null &&
+ parentCS.getRange().minValue != null &&
parentCS.getRange().maxValue != null) {
+ long valuesCount = agg.getDistinct() ?
+ parentCS.getCountDistint() :
+ parentStats.getNumRows() - parentCS.getNumNulls();
+ Range range = parentCS.getRange();
+ // Get the aggregate function matching the name in the query.
+ GenericUDAFResolver udaf =
+
FunctionRegistry.getGenericUDAFResolver(agg.getGenericUDAFName());
+ if (udaf instanceof GenericUDAFCount) {
+ cs.setRange(new Range(0, valuesCount));
+ } else if (udaf instanceof GenericUDAFMax || udaf instanceof
GenericUDAFMin) {
+ cs.setRange(new Range(range.minValue, range.maxValue));
+ } else if (udaf instanceof GenericUDAFSum) {
+ switch (aggType) {
+ case serdeConstants.TINYINT_TYPE_NAME:
+ case serdeConstants.SMALLINT_TYPE_NAME:
+ case serdeConstants.DATE_TYPE_NAME:
+ case serdeConstants.INT_TYPE_NAME:
+ case serdeConstants.BIGINT_TYPE_NAME:
+ long maxValueLong = range.maxValue.longValue();
+ long minValueLong = range.minValue.longValue();
+ // If min value is less or equal to max value (legal)
+ if (minValueLong <= maxValueLong && minValueLong >= 0) {
+ // min = minValue, max = (minValue + maxValue) * 0.5 *
parentNumRows
+ cs.setRange(new Range(
+ minValueLong,
+ StatsUtils.safeMult(
+ StatsUtils.safeMult(StatsUtils.safeAdd(minValueLong,
maxValueLong), 0.5),
+ valuesCount)));
+ }
+ break;
+ case serdeConstants.FLOAT_TYPE_NAME:
+ case serdeConstants.DOUBLE_TYPE_NAME:
+ double maxValueDouble = range.maxValue.doubleValue();
+ double minValueDouble = range.minValue.doubleValue();
+ // If min value is less or equal to max value (legal)
+ if (minValueDouble <= maxValueDouble && minValueDouble >= 0) {
+ // min = minValue, max = (minValue + maxValue) * 0.5 *
parentNumRows
Review Comment:
What's the logic behind this? Wouldn't it be `max * parentNumRows`? Take for
example the values [1, 1000, 1001, 1002, 1003]. If we sum them we get 4007.
However, the formula evaluates to (1+1003)*0.5*5 = 2510, which is smaller than
4007.
I've checked the [most recent version of the
file](https://github.com/apache/hive/blob/ee7138bf7a1a1ee1de07fa2243fe947a627268cb/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L1809),
and it is still the same.
Wdyt, @jcamachor?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]