Re: [I] SELECT count(*) on Hudi COW tables reads full file contents where vanilla Spark+Parquet reads only row counts from footers (Hudi 1.1.1, Spark 3.4) [hudi]

via GitHub Mon, 18 May 2026 01:58:11 -0700


linliu-code commented on issue #18769:
URL: https://github.com/apache/hudi/issues/18769#issuecomment-4476018537


   ## Update: fix implemented on master and validated locally
   
   Applied a fast-path patch to `HoodieFileGroupReaderBasedFileFormat` (lines 
in `hudi-spark-common` shared module, so the fix benefits Spark 3.3 / 3.4 / 3.5 
/ 4.0 bundles when rebuilt). The patch:
   
   1. Adds an `if (isCount) readCountFromFooter(...)` branch before the 
existing per-file lambda's pattern match.
   2. `readCountFromFooter` reads only the parquet footer 
(`ParquetFileReader.readFooter(..., NO_FILTER)`), sums 
`BlockMetaData.getRowCount()` across row groups, and emits either 
`ColumnarBatch` (when the downstream is vectorized) or `InternalRow`. Partition 
columns are populated as constants from `file.partitionValues` so codegen that 
touches column[i] still sees valid data.
   
   Patch is +84 / -2 lines. One method added, one branch in the lambda, three 
imports (`HadoopFSUtils`, `ParquetMetadataConverter`, `ParquetFileReader`, plus 
`ConstantColumnVector` and `ColumnVectorUtils` for the columnar path).
   
   ### Validation against the same probe as the issue body
   
   Built `hudi-spark3.4-bundle_2.12-1.3.0-SNAPSHOT.jar` from patched master and 
ran the original count(*) probe at two scales:
   
   | Scale | partitions × rows/part | Hudi count | Raw count | Hudi wall | Raw 
wall | Wall ratio |
   |---|---|---|---|---|---|---|
   | S | 1000 × 10 | 10,000 ✓ | 10,000 ✓ | 313 ms | 296 ms | **1.06×** |
   | L | 100 × 10,000 | 1,000,000 ✓ | 1,000,000 ✓ | 73 ms | 59 ms | **1.24×** |
   
   Compare to pre-patch ratios from the issue body: 2.76× at S, 2.18× at L. The 
fix essentially closes the wall-clock gap.
   
   bytesRead is also approximately halved (441 MB → from 882 MB at S; 44 MB → 
from 88 MB at L), but the residual ~50% appears to come from Hudi's larger 
embedded footer (col-stats, bloom filter) plus driver-side MDT reads, neither 
of which is in this issue's scope.
   
   ### Sanity
   
   On a small 50-row table with the patched bundle:
   - `SELECT count(*) WHERE rk<10` → 10 (non-count path with filter, untouched 
by the patch)
   - `SELECT sum(val)` → 1225 (column-access aggregation)
   - `SELECT * LIMIT 5` → correct row values
   
   Non-count queries unaffected; the patch only adds an `if (isCount)` branch 
and falls through to the existing code path otherwise.
   
   Happy to open a PR with this patch if it would be useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] SELECT count(*) on Hudi COW tables reads full file contents where vanilla Spark+Parquet reads only row counts from footers (Hudi 1.1.1, Spark 3.4) [hudi]

Reply via email to