thomasrebele commented on code in PR #6293:
URL: https://github.com/apache/hive/pull/6293#discussion_r2841549249


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -184,91 +188,284 @@ public Double visitCall(RexCall call) {
     return selectivity;
   }
 
+  /**
+   * If the cast can be removed, just return its operand and adjust the 
boundaries if necessary.
+   *
+   * <p>
+   *   In Hive, if a value cannot be represented by the cast, the result of 
the cast is NULL,
+   *   and therefore cannot fulfill the predicate. So the possible range of 
the values
+   *   is limited by the range of possible values of the type.
+   * </p>
+   *
+   * <p>
+   *   Special care is taken to support the cast to DECIMAL(precision, scale):
+   *   The cast to DECIMAL rounds the value the same way as {@link 
RoundingMode#HALF_UP}.
+   *   The boundaries are adjusted accordingly, without changing the semantics 
of <code>inclusive</code>.
+   * </p>
+   *
+   * @param cast a RexCall of type {@link SqlKind#CAST}
+   * @param tableScan the table that provides the statistics
+   * @param boundaries indexes 0 and 1 are the boundaries of the range 
predicate;
+   *                   indexes 2 and 3, if they exist, will be set to the 
boundaries of the type range
+   * @param inclusive whether the respective boundary is inclusive or 
exclusive.
+   * @return the operand if the cast can be removed, otherwise the cast itself
+   */
+  private RexNode removeCastIfPossible(RexCall cast, HiveTableScan tableScan, 
float[] boundaries, boolean[] inclusive) {
+    RexNode op0 = cast.getOperands().getFirst();
+    if (!(op0 instanceof RexInputRef)) {
+      return cast;
+    }
+    int index = ((RexInputRef) op0).getIndex();
+    final List<ColStatistics> colStats = 
tableScan.getColStat(Collections.singletonList(index));
+    if (colStats.isEmpty()) {
+      return cast;
+    }
+
+    // we need to check that the possible values of the input to the cast are 
all within the type range of the cast
+    // otherwise the CAST introduces some modulo-like behavior (*)
+    ColStatistics colStat = colStats.getFirst();
+    ColStatistics.Range range = colStat.getRange();
+    if (range == null)
+      return cast;
+    if (range.minValue == null || Double.isNaN(range.minValue.doubleValue()))
+      return cast;
+    if (range.maxValue == null || Double.isNaN(range.maxValue.doubleValue()))
+      return cast;
+
+    String type = cast.getType().getSqlTypeName().getName();
+
+    double min;
+    double max;
+    switch (type.toLowerCase()) {
+    case serdeConstants.TINYINT_TYPE_NAME:
+      min = Byte.MIN_VALUE;
+      max = Byte.MAX_VALUE;
+      break;
+    case serdeConstants.SMALLINT_TYPE_NAME:
+      min = Short.MIN_VALUE;
+      max = Short.MAX_VALUE;
+      break;
+    case serdeConstants.INT_TYPE_NAME, "integer":
+      min = Integer.MIN_VALUE;
+      max = Integer.MAX_VALUE;
+      break;
+    case serdeConstants.BIGINT_TYPE_NAME, serdeConstants.TIMESTAMP_TYPE_NAME:
+      min = Long.MIN_VALUE;
+      max = Long.MAX_VALUE;
+      break;
+    case serdeConstants.FLOAT_TYPE_NAME:
+      min = -Float.MAX_VALUE;
+      max = Float.MAX_VALUE;
+      break;
+    case serdeConstants.DOUBLE_TYPE_NAME:
+      min = -Double.MAX_VALUE;
+      max = Double.MAX_VALUE;
+      break;
+    case serdeConstants.DECIMAL_TYPE_NAME:
+      min = -Double.MAX_VALUE;
+      max = Double.MAX_VALUE;
+      adjustBoundariesForDecimal(cast, boundaries, inclusive);
+      break;
+    default:
+      // unknown type, do not remove the cast
+      return cast;
+    }
+
+    // see (*)
+    if (range.minValue.doubleValue() < min)
+      return cast;
+    if (range.maxValue.doubleValue() > max)
+      return cast;
+
+    return op0;
+  }
+
+  /**
+   * Adjust the boundaries for a DECIMAL cast.
+   * <p>
+   * See {@link #removeCastIfPossible(RexCall, HiveTableScan, float[], 
boolean[])}
+   * for an explanation of the parameters.
+   */
+  private static void adjustBoundariesForDecimal(RexCall cast, float[] 
boundaries, boolean[] inclusive) {

Review Comment:
   That can indeed happen for some corner cases. It was handled by 
`FilterSelectivityEstimator#rangedSelectivity(KllFloatsSketch, float, float)`, 
which checks for the validity of the boundaries, and otherwise returns 0. If we 
want to continue with Guava's `Range`, we would need to extract the creation of 
the Range objects into a method that checks for valid boundaries, and otherwise 
creates an empty Range, e.g., `[0,0)`.
   
   When I did the refactoring, I decided against using Guava's Range, as a 
simple record class was enough for the purpose of the PR. @zabetak, if you want 
to continue with Guava's `Range`, let me know, I can make the necessary changes 
based on your branch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to