linliu-code commented on issue #18755:
URL: https://github.com/apache/hudi/issues/18755#issuecomment-4465130650
## Update — exact threshold confirmed via parquet-mr source + empirical
bisection
Following up on the root cause. I bisected the trigger condition end-to-end.
The exact threshold is:
> Parquet drops the column-chunk statistics (and therefore Hudi mis-records
`nullCount=valueCount`) when `min.length() + max.length() >= 4096 bytes`.
This is from `parquet-mr 1.12.3`:
```java
// org/apache/parquet/format/converter/ParquetMetadataConverter.java:149
public static final long MAX_STATS_SIZE = 4096; // limit stats to 4k
// ParquetMetadataConverter.toParquetStatistics:707
if (!stats.isEmpty() && withinLimit(stats, truncateLength)) {
formatStats.setNull_count(stats.getNumNulls()); // null_count ONLY set
inside this guard
...
}
return formatStats; // empty formatStats returned when guard fails
// org/apache/parquet/column/statistics/BinaryStatistics.java:106
public boolean isSmallerThan(long size) {
return !hasNonNullValue() || ((min.length() + max.length()) < size);
}
```
When `withinLimit` returns false, `formatStats` stays empty — including
`null_count`. On read, parquet-mr reconstructs a Java `Statistics<?>` where
`isEmpty()=true, isNumNullsSet()=false, hasNonNullValue()=false`. Hudi's line
320 then ternaries on `isEmpty()=true` and records `nullCount = valueCount`.
### Empirical confirmation of the exact boundary
Same test rig as the original repro, varying only `min` and `max` lengths:
| min_len | max_len | sum | Result |
|---|---|---|---|
| 1 | 2000 | 2001 | stats kept |
| 1 | 4094 | **4095** | **stats kept** (tightest pass) |
| 1 | **4095** | **4096** | **stats DROPPED** (boundary) |
| 1 | 4100 | 4101 | stats dropped |
| 1 | 4096 | 4097 | stats dropped (original repro) |
The `<` (strict-less-than) in `isSmallerThan` is what makes sum=4096 exactly
already trip the drop.
### Side-finding: configuration overrides don't propagate to Hudi's writer
`parquet-mr`'s default `ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH
= 2147483647` (Integer.MAX_VALUE), effectively disabling truncation, which
makes `isSmallerThanWithTruncation(4096, MAX_VALUE)` also fail for any value
over ~4 KB.
I attempted to mitigate from the Spark/user side by overriding
`spark.sql.parquet.statisticsTruncateLength` and
`parquet.statistics.truncate.length` (both via `spark.conf.set(...)` and via
Hadoop conf). Neither override propagated to Hudi's Parquet writer — stats were
still dropped at the same threshold. Users have no working knob to avoid this
on the write side until the line-320 fix lands. Worth surfacing in the same PR.
### Real-world impact characterization
A single row in a file whose target column has a value contributing `min+max
>= 4 KB` is enough to trip the bug. Common real-world cases that hit this:
- A `description`, `notes`, `comment`, `payload`, `error_message`, or
`stack_trace` column where any one row exceeds ~4 KB.
- A serialized-JSON column where one row is a larger document.
- A composite primary key (`tenant_id + entity_path + uuid + ...`) where one
tenant's key is long.
- A URL column or similar where any one value exceeds ~4 KB.
Once such a row is written into a file, every equality query against any
column of that file silently returns zero rows.
### Suggested patch (refined)
Both the line-320 and the corresponding NaN-side fix
(`convertToNativeJavaType` returning fabricated 0.0 — see #18754) come from the
same misinterpretation: parquet-mr's empty Statistics is treated as "all-null
column." The fix needs to distinguish "stats genuinely absent" via
`isNumNullsSet()`:
```java
// hudi-hadoop-common/.../ParquetUtils.java readColumnStatsFromMetadata,
around line 320
long nullCount;
Object minVal = null;
Object maxVal = null;
if (stats.isEmpty()) {
if (stats.isNumNullsSet()) {
// genuine all-null column: parquet recorded num_nulls and chose
// not to write min/max for an all-null column
nullCount = columnChunkMetaData.getValueCount();
// min/max remain null
} else {
// stats unavailable (e.g. min.length + max.length >= 4096 in
parquet-mr's
// ParquetMetadataConverter.toParquetStatistics). Record "stats
unknown" so
// the data-skipping reader treats this column's stats as
unavailable.
nullCount = -1L; // sentinel; reader must interpret as "no skip
info"
// min/max remain null
}
} else {
nullCount = stats.getNumNulls();
minVal = convertToNativeJavaType(columnChunkMetaData.getPrimitiveType(),
stats.genericGetMin(), valueMetadata);
maxVal = convertToNativeJavaType(columnChunkMetaData.getPrimitiveType(),
stats.genericGetMax(), valueMetadata);
}
return HoodieColumnRangeMetadata.create(
filePath,
columnChunkMetaData.getPath().toDotString(),
(Comparable) minVal,
(Comparable) maxVal,
nullCount,
columnChunkMetaData.getValueCount(),
columnChunkMetaData.getTotalSize(),
columnChunkMetaData.getTotalUncompressedSize(),
valueMetadata);
```
Plus reader-side: the data-skipping evaluator must check for the "unknown"
sentinel (or equivalent state) on `HoodieColumnRangeMetadata` and bypass
stats-based pruning for that file × column.
Reproduction is unchanged from the original report; the threshold info is
just useful for triage and for users to scope their existing tables (`SELECT
COUNT(*) FROM tbl WHERE LENGTH(target_col) >= 4095` is a rough first-pass
check).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]