Re: [PR] fix: Fix col-stats corruption from NaN values and stats-truncation [hudi]

via GitHub Wed, 20 May 2026 06:15:29 -0700


hudi-agent commented on code in PR #18792:
URL: https://github.com/apache/hudi/pull/18792#discussion_r3274197564



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala:
##########
@@ -494,7 +513,15 @@ object ColumnStatsExpressionUtils {
     val minValueExpr = targetExprBuilder.apply(genColMinValueExpr(colName))
     val maxValueExpr = targetExprBuilder.apply(genColMaxValueExpr(colName))
     // Only case when column C contains _only_ value V is when min(C) = V AND 
max(c) = V
-    And(EqualTo(minValueExpr, value), EqualTo(maxValueExpr, value))
+    // NOTE: this is used inside Not(...) predicates (e.g. `colA != V`, `colA 
NOT IN (...)`). For
+    //       those, treating "stats unavailable" as "value V is the only 
value" would PRUNE the
+    //       file (Not(true) = false). So we treat the inner expression as 
false (file is
+    //       guaranteed not to match the negation) when stats are missing, 
which means
+    //       Not(thisExpr) becomes Not(false) = true → file IS scanned. We 
achieve that by
+    //       returning false (And with false) when stats are unavailable, 
which Not then flips.
+    val ifStatsAvailable = And(EqualTo(minValueExpr, value), 
EqualTo(maxValueExpr, value))

Review Comment:
   🤖 The new `IsNotNull(min)` wrap here changes 
`genColumnOnlyValuesEqualToExpression` from a self-contained "only-value-V" 
predicate into one whose semantics depend on the caller wrapping it in 
`Not(...)` (per the NOTE). All three current callers do, so this is fine today, 
but the helper's name no longer matches what it computes. Could you rename it 
(e.g. `genNegatedOnlyValuesEqualToExpression`) or push the `Not` and the 
`IsNotNull` guard together into a single helper so the precondition is encoded 
in the name/signature rather than only in the comment?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala:
##########
@@ -479,13 +482,29 @@ object ColumnStatsExpressionUtils {
   @inline def genColNumNullsExpr(colName: String): Expression = 
sparkAdapter.getExpressionFromColumn(col(getNullCountColumnNameFor(colName)))
   @inline def genColValueCountExpr: Expression = 
sparkAdapter.getExpressionFromColumn(col(getValueCountColumnNameFor))
 
+  /**
+   * Wraps a min/max-based skipping predicate with a null-safety check. When 
parquet-mr reports
+   * stats as unreliable (e.g. any NaN in a FLOAT/DOUBLE column, or a 
column-chunk where stats
+   * were omitted due to value-size truncation), Hudi's writer persists 
min/max as null so the
+   * stats can't be trusted. A bare `colA_maxValue > X` predicate would 
evaluate to null in that
+   * case, which Spark treats as false in WHERE — silently pruning a file that 
may contain
+   * matching rows. Wrapping with `OR colA_minValue IS NULL` makes the file 
pass through and be
+   * scanned, preserving correctness at the cost of skipping an opportunity. 
min/max are written
+   * jointly (both null or both non-null), so checking min is sufficient.
+   */
+  @inline def withStatsAvailable(skipExpr: Expression, colName: String): 
Expression = {

Review Comment:
   🤖 nit: `withStatsAvailable` reads as "apply this only when stats are 
available", but the method actually does the opposite — it adds an OR-null 
guard so the file passes through *when stats are unavailable*. Something like 
`withNullStatsGuard` or `guardedByStatsReliability` would make the semantics 
clearer at every call site.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -253,6 +253,37 @@ private static class ColumnStats {
     }
   }
 
+  private static boolean isFloatingPointNaN(Object value) {
+    if (value instanceof Double) {
+      return Double.isNaN((Double) value);
+    }
+    if (value instanceof Float) {
+      return Float.isNaN((Float) value);
+    }
+    return false;
+  }
+
+  /**
+   * Returns true if the given min/max value is null in a way that indicates 
stats are
+   * unreliable (rather than the column being legitimately all-null). When a 
column is
+   * truly all-null, nullCount == valueCount and a null min/max is the correct 
answer.
+   * When stats are unreliable (NaN poisoning, value-size truncation), 
nullCount is
+   * lower than valueCount but min/max are still null — that's the case 
partition-stats
+   * aggregation must NOT silently drop.
+   */
+  private static boolean isStatsUnreliable(HoodieMetadataColumnStats stats, 
Comparable unwrappedValue) {
+    if (unwrappedValue != null) {
+      return false;
+    }
+    Long nullCount = stats.getNullCount();
+    Long valueCount = stats.getValueCount();
+    if (nullCount == null || valueCount == null) {
+      return true;
+    }
+    // All-null column: stats are legitimately empty for min/max.
+    return nullCount != valueCount;
+  }

Review Comment:
   🤖 I think `nullCount != valueCount` is doing a reference comparison on boxed 
`Long` objects here — 
`HoodieMetadataColumnStats.getNullCount()`/`getValueCount()` return `Long` (the 
avsc declares `["null", "long"]`), and Java's `Long.valueOf` only caches values 
in `[-128, 127]`. So for any nullCount/valueCount > 127 (i.e. virtually every 
real file), `Long(1000) != Long(1000)` evaluates to true even when the values 
are equal, and a legitimately all-null column would be misclassified as 
`unreliable`. Could you switch to `!nullCount.equals(valueCount)` (or 
`nullCount.longValue() != valueCount.longValue()`)? The sibling helper in 
`HoodieColumnRangeMetadata` works because the fields are primitive `long` there 
— worth aligning. @yihua could you verify this once?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/stats/HoodieColumnRangeMetadata.java:
##########
@@ -143,6 +157,14 @@ public static <T extends Comparable<T>> 
HoodieColumnRangeMetadata<T> merge(
     return new HoodieColumnRangeMetadata<>(filePath, columnName, min, max, 
nullCount, valueCount, totalSize, totalUncompressedSize, 
left.getValueMetadata());
   }
 
+  private static <T extends Comparable<T>> boolean 
isStatsUnreliable(HoodieColumnRangeMetadata<T> range, T value) {
+    if (value != null) {
+      return false;
+    }
+    // All-null column: stats are legitimately empty for min/max — safe to 
ignore in merge.

Review Comment:
   🤖 nit: the comment describes the *false* case (all-null column → reliable) 
but sits directly before the return that evaluates to `true` when unreliable 
(`nullCount != valueCount`). Could you flip the framing so it describes what 
`true` means — e.g. `// Unreliable: min/max are null but the column is not 
all-null (nullCount < valueCount).`
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Fix col-stats corruption from NaN values and stats-truncation [hudi]

Reply via email to