zabetak commented on code in PR #6293:
URL: https://github.com/apache/hive/pull/6293#discussion_r2918330225
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -184,91 +189,383 @@ public Double visitCall(RexCall call) {
return selectivity;
}
+ /**
+ * Return whether the expression is a removable cast based on stats and type
bounds.
+ *
+ * <p>
+ * In Hive, if a value cannot be represented by the cast, the result of the
cast is NULL,
+ * and therefore cannot fulfill the predicate. So the possible range of the
values
+ * is limited by the range of possible values of the type.
+ * </p>
Review Comment:
Thanks for the additions in the Javadoc. I have some small follow-up
questions regarding the `CAST` categories.
In the first category, the Javadoc says that "the result of the cast is
`NULL`, and therefore cannot fulfill the predicate". However, the cast of an
`TINYINT` to `SMALLINT` does not return `NULL` and the same holds for most
"widening" casts. How does this example relate to this category?
In addition, this method deals also with the case that the value can be
represented by the cast (thus removable), so maybe the Javadoc should list this
category as well.
##########
ql/src/test/org/apache/hadoop/hive/ql/optimizer/calcite/stats/TestFilterSelectivityEstimator.java:
##########
@@ -511,6 +605,428 @@ public void
testComputeRangePredicateSelectivityNotBetweenWithNULLS() {
doReturn(Collections.singletonList(stats)).when(tableMock).getColStat(Collections.singletonList(0));
RexNode filter = REX_BUILDER.makeCall(HiveBetween.INSTANCE, boolTrue,
inputRef0, int1, int3);
FilterSelectivityEstimator estimator = new
FilterSelectivityEstimator(scan, mq);
- Assert.assertEquals(0.55, estimator.estimateSelectivity(filter), DELTA);
+ // only the values 4, 5, 6, 7 fulfill the condition NOT BETWEEN 1 AND 3
+ // (the NULL values do not fulfill the condition)
+ Assert.assertEquals(0.2, estimator.estimateSelectivity(filter), DELTA);
+ }
+
+ @Test
+ public void testRangePredicateCastIntegerValuesInsideTypeRange() {
+ // use VALUES2, even if the tested types cannot represent its values
+ // we're only interested in whether the cast to a smaller integer type
results in the default selectivity
+ useFieldWithValues("f_tinyint", VALUES, KLL);
+ checkSelectivity(3 / 13.f, ge(cast("f_tinyint", TINYINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_tinyint", SMALLINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_tinyint", INTEGER), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_tinyint", BIGINT), int5));
+
+ useFieldWithValues("f_smallint", VALUES, KLL);
+ checkSelectivity(3 / 13.f, ge(cast("f_smallint", TINYINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_smallint", SMALLINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_smallint", INTEGER), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_smallint", BIGINT), int5));
+
+ useFieldWithValues("f_integer", VALUES, KLL);
+ checkSelectivity(3 / 13.f, ge(cast("f_integer", TINYINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_integer", SMALLINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_integer", INTEGER), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_integer", BIGINT), int5));
+
+ useFieldWithValues("f_bigint", VALUES, KLL);
+ checkSelectivity(3 / 13.f, ge(cast("f_bigint", TINYINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_bigint", SMALLINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_bigint", INTEGER), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_bigint", BIGINT), int5));
+ }
+
+ @Test
+ public void testRangePredicateCastIntegerValuesOutsideTypeRange() {
+ // use VALUES2, even if the tested types cannot represent its values
+ // we're only interested in whether the cast to a smaller integer type
results in the default selectivity
+ useFieldWithValues("f_tinyint", VALUES2, KLL2);
+ checkSelectivity(16 / 28.f, ge(cast("f_tinyint", TINYINT), int5));
+ checkSelectivity(18 / 28.f, ge(cast("f_tinyint", SMALLINT), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_tinyint", INTEGER), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_tinyint", BIGINT), int5));
+
+ useFieldWithValues("f_smallint", VALUES2, KLL2);
+ checkSelectivity(1 / 3.f, ge(cast("f_smallint", TINYINT), int5));
+ checkSelectivity(18 / 28.f, ge(cast("f_smallint", SMALLINT), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_smallint", INTEGER), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_smallint", BIGINT), int5));
+
+ useFieldWithValues("f_integer", VALUES2, KLL2);
+ checkSelectivity(1 / 3.f, ge(cast("f_integer", TINYINT), int5));
+ checkSelectivity(1 / 3.f, ge(cast("f_integer", SMALLINT), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_integer", INTEGER), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_integer", BIGINT), int5));
+
+ useFieldWithValues("f_bigint", VALUES2, KLL2);
+ checkSelectivity(1 / 3.f, ge(cast("f_bigint", TINYINT), int5));
+ checkSelectivity(1 / 3.f, ge(cast("f_bigint", SMALLINT), int5));
+ checkSelectivity(1 / 3.f, ge(cast("f_bigint", INTEGER), int5));
+ checkSelectivity(20 / 28.f, ge(cast("f_bigint", BIGINT), int5));
+ }
+
+ @Test
+ public void testRangePredicateTypeMatrix() {
+ // checks many possible combinations of types
+ List<RelDataTypeField> fields = tableType.getFieldList();
+ for (var srcField : fields) {
+ if (isTemporal(srcField.getType())) {
+ continue;
+ }
+
+ useFieldWithValues(srcField.getName(), VALUES, KLL);
+
+ for (var tgt : fields) {
+ try {
+ if (isTemporal(tgt.getType())) {
+ continue;
+ }
+
+ RexNode expr = cast(srcField.getName(), tgt.getType());
+ checkBetweenSelectivity(3, VALUES.length, VALUES.length, expr, 5, 7);
+ } catch (AssertionError e) {
+ throw new AssertionError("Error when casting from " +
srcField.getType() + " to " + tgt.getType(), e);
+ }
+ }
+ }
+ }
+
+ private boolean isTemporal(RelDataType type) {
+ return type.getSqlTypeName() == TIMESTAMP || type.getSqlTypeName() ==
SqlTypeName.DATE;
+ }
+
+ @Test
+ public void testRangePredicateWithCast() {
+ useFieldWithValues("f_numeric", VALUES, KLL);
+ checkSelectivity(3 / 13.f, ge(cast("f_numeric", TINYINT), int5));
+ checkSelectivity(10 / 13.f, lt(cast("f_numeric", TINYINT), int5));
+ checkSelectivity(2 / 13.f, gt(cast("f_numeric", TINYINT), int5));
+ checkSelectivity(11 / 13.f, le(cast("f_numeric", TINYINT), int5));
+
+ checkSelectivity(12 / 13f, ge(cast("f_numeric", TINYINT), int2));
+ checkSelectivity(1 / 13f, lt(cast("f_numeric", TINYINT), int2));
+ checkSelectivity(5 / 13f, gt(cast("f_numeric", TINYINT), int2));
+ checkSelectivity(8 / 13f, le(cast("f_numeric", TINYINT), int2));
+
+ // check some types
+ checkSelectivity(3 / 13.f, ge(cast("f_numeric", INTEGER), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_numeric", SMALLINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_numeric", BIGINT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_numeric", SqlTypeName.FLOAT), int5));
+ checkSelectivity(3 / 13.f, ge(cast("f_numeric", SqlTypeName.DOUBLE),
int5));
+ }
+
+ @Test
+ public void testRangePredicateWithCast2() {
+ useFieldWithValues("f_numeric", VALUES2, KLL2);
+ RelDataType decimal3s1 = decimalType(3, 1);
+ checkSelectivity(4 / 28.f, ge(cast("f_numeric", decimal3s1),
literalFloat(1)));
+
+ // values from -99.94999 to 99.94999 (both inclusive)
+ checkSelectivity(7 / 28.f, lt(cast("f_numeric", decimal3s1),
literalFloat(100)));
+ checkSelectivity(7 / 28.f, le(cast("f_numeric", decimal3s1),
literalFloat(100)));
+ checkSelectivity(0 / 28.f, gt(cast("f_numeric", decimal3s1),
literalFloat(100)));
+ checkSelectivity(0 / 28.f, ge(cast("f_numeric", decimal3s1),
literalFloat(100)));
+
+ RelDataType decimal4s1 = decimalType(4, 1);
+ checkSelectivity(10 / 28.f, lt(cast("f_numeric", decimal4s1),
literalFloat(100)));
+ checkSelectivity(20 / 28.f, le(cast("f_numeric", decimal4s1),
literalFloat(100)));
+ checkSelectivity(3 / 28.f, gt(cast("f_numeric", decimal4s1),
literalFloat(100)));
+ checkSelectivity(13 / 28.f, ge(cast("f_numeric", decimal4s1),
literalFloat(100)));
+
+ RelDataType decimal2s1 = decimalType(2, 1);
+ checkSelectivity(2 / 28.f, lt(cast("f_numeric", decimal2s1),
literalFloat(100)));
+ checkSelectivity(2 / 28.f, le(cast("f_numeric", decimal2s1),
literalFloat(100)));
+ checkSelectivity(0 / 28.f, gt(cast("f_numeric", decimal2s1),
literalFloat(100)));
+ checkSelectivity(0 / 28.f, ge(cast("f_numeric", decimal2s1),
literalFloat(100)));
+
+ // expected: 100_000f
+ RelDataType decimal7s1 = decimalType(7, 1);
+ checkSelectivity(1 / 28.f, gt(cast("f_numeric", decimal7s1),
literalFloat(10000)));
+
+ // expected: 10_000f, 100_000f, because CAST(1_000_000 AS DECIMAL(7,1)) =
NULL, and similar for even larger values
+ checkSelectivity(2 / 28.f, ge(cast("f_numeric", decimal7s1),
literalFloat(9999)));
+ checkSelectivity(2 / 28.f, ge(cast("f_numeric", decimal7s1),
literalFloat(10000)));
+
+ // expected: 100_000f
+ checkSelectivity(1 / 28.f, gt(cast("f_numeric", decimal7s1),
literalFloat(10000)));
+ checkSelectivity(1 / 28.f, gt(cast("f_numeric", decimal7s1),
literalFloat(10001)));
+
+ // expected 1f, 10f, 99.94998f, 99.94999f
+ checkSelectivity(4 / 28.f, ge(cast("f_numeric", decimal3s1),
literalFloat(1)));
+ checkSelectivity(3 / 28.f, gt(cast("f_numeric", decimal3s1),
literalFloat(1)));
+ // expected -99.94999f, -99.94998f, 0f, 1f
+ checkSelectivity(4 / 28.f, le(cast("f_numeric", decimal3s1),
literalFloat(1)));
+ checkSelectivity(3 / 28.f, lt(cast("f_numeric", decimal3s1),
literalFloat(1)));
+ }
+
+ private void checkTimeFieldOnMidnightTimestamps(RexNode field) {
+ // note: use only values from VALUES_TIME that specify a date without
hh:mm:ss!
+ checkSelectivity(7 / 7.f, ge(field, literalTimestamp("2020-11-01")));
+ checkSelectivity(5 / 7.f, ge(field, literalTimestamp("2020-11-03")));
+ checkSelectivity(1 / 7.f, ge(field, literalTimestamp("2020-11-07")));
+
+ checkSelectivity(6 / 7.f, gt(field, literalTimestamp("2020-11-01")));
+ checkSelectivity(4 / 7.f, gt(field, literalTimestamp("2020-11-03")));
+ checkSelectivity(0 / 7.f, gt(field, literalTimestamp("2020-11-07")));
+
+ checkSelectivity(1 / 7.f, le(field, literalTimestamp("2020-11-01")));
+ checkSelectivity(3 / 7.f, le(field, literalTimestamp("2020-11-03")));
+ checkSelectivity(7 / 7.f, le(field, literalTimestamp("2020-11-07")));
+
+ checkSelectivity(0 / 7.f, lt(field, literalTimestamp("2020-11-01")));
+ checkSelectivity(2 / 7.f, lt(field, literalTimestamp("2020-11-03")));
+ checkSelectivity(6 / 7.f, lt(field, literalTimestamp("2020-11-07")));
+ }
+
+ private void checkTimeFieldOnIntraDayTimestamps(RexNode field) {
+ checkSelectivity(3 / 7.f, ge(field,
literalTimestamp("2020-11-05T11:23:45Z")));
+ checkSelectivity(2 / 7.f, gt(field,
literalTimestamp("2020-11-05T11:23:45Z")));
+ checkSelectivity(5 / 7.f, le(field,
literalTimestamp("2020-11-05T11:23:45Z")));
+ checkSelectivity(4 / 7.f, lt(field,
literalTimestamp("2020-11-05T11:23:45Z")));
+ }
+
+ @Test
+ public void testRangePredicateOnTimestamp() {
+ useFieldWithValues("f_timestamp", VALUES_TIME, KLL_TIME);
+ checkTimeFieldOnMidnightTimestamps(currentInputRef);
+ checkTimeFieldOnIntraDayTimestamps(currentInputRef);
+ }
+
+ @Test
+ public void testRangePredicateOnTimestampWithCast() {
+ useFieldWithValues("f_timestamp", VALUES_TIME, KLL_TIME);
+ RexNode expr1 = cast("f_timestamp", SqlTypeName.DATE);
+ checkTimeFieldOnMidnightTimestamps(expr1);
+ checkTimeFieldOnIntraDayTimestamps(expr1);
+
+ RexNode expr2 = cast("f_timestamp", SqlTypeName.TIMESTAMP);
+ checkTimeFieldOnMidnightTimestamps(expr2);
+ checkTimeFieldOnIntraDayTimestamps(expr2);
+ }
+
+ @Test
+ public void testRangePredicateOnDate() {
+ useFieldWithValues("f_date", VALUES_TIME, KLL_TIME);
+ checkTimeFieldOnMidnightTimestamps(currentInputRef);
+
+ // it does not make sense to compare with "2020-11-05T11:23:45Z",
+ // as that value would not be stored as-is in a date column, but as
"2020-11-05" instead
+ }
+
+ @Test
+ public void testRangePredicateOnDateWithCast() {
+ useFieldWithValues("f_date", VALUES_TIME, KLL_TIME);
+ checkTimeFieldOnMidnightTimestamps(cast("f_date", SqlTypeName.DATE));
+ checkTimeFieldOnMidnightTimestamps(cast("f_date", SqlTypeName.TIMESTAMP));
+
+ // it does not make sense to compare with "2020-11-05T11:23:45Z",
+ // as that value would not be stored as-is in a date column, but as
"2020-11-05" instead
+ }
+
+ @Test
+ public void testBetweenWithCastToTinyIntCheckRounding() {
+ useFieldWithValues("f_numeric", VALUES3, KLL3);
+ float total = VALUES3.length;
+ float universe = 10; // the number of values that "survive" the cast
+ RexNode cast = cast("f_numeric", TINYINT);
+ // check rounding of positive numbers
+ checkBetweenSelectivity(3, universe, total, cast, 0, 10);
+ checkBetweenSelectivity(4, universe, total, cast, 0, 11);
+ checkBetweenSelectivity(4, universe, total, cast, 10, 20);
+ checkBetweenSelectivity(1, universe, total, cast, 11, 20);
+
+ // check rounding of negative numbers
+ checkBetweenSelectivity(4, universe, total, cast, -20, -10);
+ checkBetweenSelectivity(1, universe, total, cast, -20, -11);
+ checkBetweenSelectivity(3, universe, total, cast, -10, 0);
+ checkBetweenSelectivity(4, universe, total, cast, -11, 0);
+ }
+
+ @Test
+ public void testBetweenWithCastToTinyInt() {
+ useFieldWithValues("f_numeric", VALUES3, KLL3);
+ float total = VALUES3.length;
+ float universe = 10; // the number of values that "survive" the cast
+ RexNode cast = cast("f_numeric", TINYINT);
+ checkBetweenSelectivity(5, universe, total, cast, 0, 1e20f);
+ checkBetweenSelectivity(5, universe, total, cast, -1e20f, 0);
+ checkBetweenSelectivity(0, universe, total, cast, 100f, 0f);
+ }
+
+ @Test
+ public void testBetweenWithCastToSmallInt() {
+ useFieldWithValues("f_numeric", VALUES3, KLL3);
+ float total = VALUES3.length;
+ float universe = 14; // the number of values that "survive" the cast
+ RexNode cast = cast("f_numeric", SMALLINT);
+ checkBetweenSelectivity(7, universe, total, cast, 0, 1e20f);
+ checkBetweenSelectivity(7, universe, total, cast, -1e20f, 0);
+ checkBetweenSelectivity(0, universe, total, cast, 100f, 0f);
+ }
+
+ @Test
+ public void testBetweenWithCastToInteger() {
+ useFieldWithValues("f_numeric", VALUES3, KLL3);
+ float total = VALUES3.length;
+ float universe = 18; // the number of values that "survive" the cast
+ RexNode cast = cast("f_numeric", INTEGER);
+ checkBetweenSelectivity(9, universe, total, cast, 0, 1e20f);
+ checkBetweenSelectivity(9, universe, total, cast, -1e20f, 0);
+ checkBetweenSelectivity(0, universe, total, cast, 100f, 0f);
+ }
+
+ @Test
+ public void testBetweenWithCastToBigInt() {
Review Comment:
The main point that I wanted to highlight is the different coverage for
between predicates and range predicates when using `VALUES2` and `VALUES3`
dataset.
Roughly I would categorize the exists tests involving casts using the tuple
`(DATASET, OPERATORS, COLUMN, CAST_TYPES)`. Based, on this we have the
following categories:
```
A. (VALUES3, BETWEEN, f_numeric, [TINYINT, SMALLINT, INT, BIGINT])
B. (VALUES2, GT, [f_tinyint, f_smallint, f_int, f_bigint], [TINYINT,
SMALLINT, INT, BIGINT])
C. (VALUES2, [GT, GE, LT, LE], f_numeric, [DECIMAL(2, 1), DECIMAL(3, 1),
DECIMAL(4, 1), DECIMAL(7, 1)])
D. (VALUES2, BETWEEN, f_numeric, [DECIMAL(2, 1), DECIMAL(3, 1), DECIMAL(4,
1), DECIMAL(7, 1)])
E. (VALUES2, BETWEEN, f_numeric, TINYINT)
```
Basically, I was expecting that for each DATASET, COLUMN, and CAST_TYPES,
there exists a category with a BETWEEN operator and a range operator. This more
or less holds for categories C and D but I don't see a dual for A, B, and E.
Maybe, we don't really need a dual for A, B, and E, but its not easy to
understand why from the existing comments and naming scheme.
Anyways, I recognize that there are already many tests covering various edge
cases so the PR has sufficient coverage to get in. Naming and grouping are
details that we can certainly ignore especially if its gonna lead to several
more days of work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]