linliu-code commented on issue #18755:
URL: https://github.com/apache/hudi/issues/18755#issuecomment-4465130650

   ## Update — exact threshold confirmed via parquet-mr source + empirical 
bisection
   
   Following up on the root cause. I bisected the trigger condition end-to-end. 
The exact threshold is:
   
   > Parquet drops the column-chunk statistics (and therefore Hudi mis-records 
`nullCount=valueCount`) when `min.length() + max.length() >= 4096 bytes`.
   
   This is from `parquet-mr 1.12.3`:
   
   ```java
   // org/apache/parquet/format/converter/ParquetMetadataConverter.java:149
   public static final long MAX_STATS_SIZE = 4096; // limit stats to 4k
   
   // ParquetMetadataConverter.toParquetStatistics:707
   if (!stats.isEmpty() && withinLimit(stats, truncateLength)) {
       formatStats.setNull_count(stats.getNumNulls());  // null_count ONLY set 
inside this guard
       ...
   }
   return formatStats;  // empty formatStats returned when guard fails
   
   // org/apache/parquet/column/statistics/BinaryStatistics.java:106
   public boolean isSmallerThan(long size) {
       return !hasNonNullValue() || ((min.length() + max.length()) < size);
   }
   ```
   
   When `withinLimit` returns false, `formatStats` stays empty — including 
`null_count`. On read, parquet-mr reconstructs a Java `Statistics<?>` where 
`isEmpty()=true, isNumNullsSet()=false, hasNonNullValue()=false`. Hudi's line 
320 then ternaries on `isEmpty()=true` and records `nullCount = valueCount`.
   
   ### Empirical confirmation of the exact boundary
   
   Same test rig as the original repro, varying only `min` and `max` lengths:
   
   | min_len | max_len | sum | Result |
   |---|---|---|---|
   | 1 | 2000 | 2001 | stats kept |
   | 1 | 4094 | **4095** | **stats kept** (tightest pass) |
   | 1 | **4095** | **4096** | **stats DROPPED** (boundary) |
   | 1 | 4100 | 4101 | stats dropped |
   | 1 | 4096 | 4097 | stats dropped (original repro) |
   
   The `<` (strict-less-than) in `isSmallerThan` is what makes sum=4096 exactly 
already trip the drop.
   
   ### Side-finding: configuration overrides don't propagate to Hudi's writer
   
   `parquet-mr`'s default `ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH 
= 2147483647` (Integer.MAX_VALUE), effectively disabling truncation, which 
makes `isSmallerThanWithTruncation(4096, MAX_VALUE)` also fail for any value 
over ~4 KB.
   
   I attempted to mitigate from the Spark/user side by overriding 
`spark.sql.parquet.statisticsTruncateLength` and 
`parquet.statistics.truncate.length` (both via `spark.conf.set(...)` and via 
Hadoop conf). Neither override propagated to Hudi's Parquet writer — stats were 
still dropped at the same threshold. Users have no working knob to avoid this 
on the write side until the line-320 fix lands. Worth surfacing in the same PR.
   
   ### Real-world impact characterization
   
   A single row in a file whose target column has a value contributing `min+max 
>= 4 KB` is enough to trip the bug. Common real-world cases that hit this:
   
   - A `description`, `notes`, `comment`, `payload`, `error_message`, or 
`stack_trace` column where any one row exceeds ~4 KB.
   - A serialized-JSON column where one row is a larger document.
   - A composite primary key (`tenant_id + entity_path + uuid + ...`) where one 
tenant's key is long.
   - A URL column or similar where any one value exceeds ~4 KB.
   
   Once such a row is written into a file, every equality query against any 
column of that file silently returns zero rows.
   
   ### Suggested patch (refined)
   
   Both the line-320 and the corresponding NaN-side fix 
(`convertToNativeJavaType` returning fabricated 0.0 — see #18754) come from the 
same misinterpretation: parquet-mr's empty Statistics is treated as "all-null 
column." The fix needs to distinguish "stats genuinely absent" via 
`isNumNullsSet()`:
   
   ```java
   // hudi-hadoop-common/.../ParquetUtils.java readColumnStatsFromMetadata, 
around line 320
   long nullCount;
   Object minVal = null;
   Object maxVal = null;
   if (stats.isEmpty()) {
       if (stats.isNumNullsSet()) {
           // genuine all-null column: parquet recorded num_nulls and chose
           // not to write min/max for an all-null column
           nullCount = columnChunkMetaData.getValueCount();
           // min/max remain null
       } else {
           // stats unavailable (e.g. min.length + max.length >= 4096 in 
parquet-mr's
           // ParquetMetadataConverter.toParquetStatistics). Record "stats 
unknown" so
           // the data-skipping reader treats this column's stats as 
unavailable.
           nullCount = -1L;   // sentinel; reader must interpret as "no skip 
info"
           // min/max remain null
       }
   } else {
       nullCount = stats.getNumNulls();
       minVal = convertToNativeJavaType(columnChunkMetaData.getPrimitiveType(),
                                        stats.genericGetMin(), valueMetadata);
       maxVal = convertToNativeJavaType(columnChunkMetaData.getPrimitiveType(),
                                        stats.genericGetMax(), valueMetadata);
   }
   return HoodieColumnRangeMetadata.create(
       filePath,
       columnChunkMetaData.getPath().toDotString(),
       (Comparable) minVal,
       (Comparable) maxVal,
       nullCount,
       columnChunkMetaData.getValueCount(),
       columnChunkMetaData.getTotalSize(),
       columnChunkMetaData.getTotalUncompressedSize(),
       valueMetadata);
   ```
   
   Plus reader-side: the data-skipping evaluator must check for the "unknown" 
sentinel (or equivalent state) on `HoodieColumnRangeMetadata` and bypass 
stats-based pruning for that file × column.
   
   Reproduction is unchanged from the original report; the threshold info is 
just useful for triage and for users to scope their existing tables (`SELECT 
COUNT(*) FROM tbl WHERE LENGTH(target_col) >= 4095` is a rough first-pass 
check).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to