linliu-code opened a new pull request, #18792: URL: https://github.com/apache/hudi/pull/18792
### Change Logs This PR fixes two related correctness bugs in the column-stats metadata table that cause silent zero-row query results when data-skipping is enabled: - **#18754** — A single `NaN` value in a `FLOAT`/`DOUBLE` column corrupts the persisted col-stats record for that file. parquet-mr clears the underlying parquet stats (`hasNonNullValue=false`) but leaves min/max at their default `0.0`. Hudi was reading those sentinel values and persisting `min=0.0, max=0.0` to the MDT. The 1.x read path then used those fabricated bounds, pruning the file at the partition-stats level and dropping rows from any predicate involving that partition — including predicates on completely unrelated columns of files that contain no NaN at all. - **#18755** — When a Parquet column omits stats entirely (e.g. one value exceeds parquet-mr's stats truncation threshold ~1KB for binary/string), the col-stats writer was mis-recording `nullCount = valueCount`, i.e. "all rows in this file are null". Any non-null predicate then silently skips that file. The existing comment at the bug site intended to handle "column contains only null values" but conflated it with the "stats absent" case. Both bugs cause silent data loss at query time — no error, no warning, just an incomplete result set. ### Root cause `hudi-hadoop-common/.../ParquetUtils.readColumnStatsFromMetadata` did two things that turned out to be unsafe in the presence of unreliable parquet stats: 1. It called `stats.genericGetMin()/getMax()` without checking `stats.hasNonNullValue()`. parquet-mr uses that flag specifically to signal "min/max are unreliable, do not trust them" — most commonly hit when any NaN appears in a `DOUBLE`/`FLOAT` column. 2. It did `stats.isEmpty() ? columnChunkMetaData.getValueCount() : stats.getNumNulls()`. `isEmpty()` is `true` both when the column is genuinely all-null and when parquet omitted stats entirely. The two cases need opposite handling. The error then propagated downstream: even after `ParquetUtils` was fixed to report null min/max for unreliable stats, the per-partition aggregator `HoodieColumnRangeMetadata.merge` silently dropped the null contribution from the unreliable file. This left `PartitionStatsIndexSupport.prunePartitions` with a partial bound that pruned the partition based on the *other* files' stats — which is the silent zero-row symptom in the #18754 reproducer. The read-side filter generator in `DataSkippingUtils` also needed updating: a predicate like `colA_maxValue > X` evaluates to null when the file has null max stat, and Spark treats null as false in `WHERE`, so the file gets pruned by the data-skipping index filter. Wrapping with `OR colA_minValue IS NULL` makes it scanned instead. ### Changes - `ParquetUtils.readColumnStatsFromMetadata` — gate min/max on `hasNonNullValue()`; gate `nullCount = valueCount` on `isNumNullsSet()`. - `HoodieTableMetadataUtil.collectColumnRangeMetadata` — skip NaN values in the streaming min/max accumulator (mirrors parquet-mr's `DoubleStatistics` policy). `Double.compare(NaN, x)` returns `+1` in Java, so an unguarded comparison would let a single NaN poison the running max forever. - `HoodieTableMetadataUtil.mergeColumnStatsRecords` — distinguish "all-null column" from "stats unreliable" when merging records during MDT compaction. - `HoodieColumnRangeMetadata.merge` — the per-file → per-partition merge helper used by `PartitionStatsIndexSupport`. Apply the same unreliability propagation here so partition-level aggregates reflect the worst-case unknown when any contributing file has unreliable stats. **This was the missing piece** that kept the read-side bug visible even after the other writer fixes were in place. - `DataSkippingUtils.scala` — wrap every min/max-using skip predicate (`<`, `>`, `<=`, `>=`, EqualTo via `genColumnValuesEqualToExpression`, In, InSet, StartsWith, the various `Not(...)` cases) with `OR colMin IS NULL` so a file with stats marked unreliable is always scanned rather than pruned by null-propagation. - `SparkBaseIndexSupport.scala` — in `getCandidateFiles`, augment the assembled index filter with a global "any column referenced in the query has null min stat" clause. Belt-and-suspenders against any predicate case the per-predicate wrap might miss. ### Impact / Risk The fixes are strictly more conservative than the prior behavior: files with unreliable stats are now scanned instead of silently pruned. Tables that don't contain NaN values and don't have stats-truncated columns are unaffected. Performance impact for those tables is zero (the new gates short-circuit when stats are reliable). For tables that do hit either condition, queries previously returning silently-wrong (incomplete) results now return correct results; pruning is less aggressive on the file(s) that triggered the unreliable-stats signal. ### Impact Describe any public API changes. No public API change. Internal write/read code only. ### Risk level Medium. The merge-helper change (`HoodieColumnRangeMetadata.merge`) is in a path shared by partition-stats aggregation and any other caller that merges per-file ranges; the semantics shift from "drop null" to "propagate null when stats are unreliable, drop null when column is genuinely all-null". The distinguishing check is `nullCount == valueCount`, which is the existing convention for the all-null sentinel — so no API change, only behavior alignment with the documented contract. ### Documentation Update No documentation changes required. The bugs were silent regressions against the existing contract that "data-skipping should never change query results". ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
