Re: [PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

via GitHub Mon, 18 May 2026 20:25:52 -0700


linliu-code commented on PR #18770:
URL: https://github.com/apache/hudi/pull/18770#issuecomment-4484130053


   Added a test class covering the fast path. **All five tests pass locally** 
(`tests=5 errors=0 failures=0`, 18.9s).
   
   
`hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCountStarFastPath.scala`:
   
   | Test | What it proves |
   |---|---|
   | `testCountStarPartitionedCOW` | Basic correctness on a partitioned COW 
table. |
   | `testCountStarUnpartitionedCOW` | Basic correctness on an unpartitioned 
table. |
   | `testCountStarOnSplittableFiles` | **Regression for the split-aware 
filter** (reviewer's catch). Forces splits via small 
`spark.sql.files.maxPartitionBytes` + small `hoodie.parquet.block.size`. 
Verifies the FileScan ends up with **more partitions than files** (so splits 
actually happened) and that `count(*)` is still exact. Without the 
row-group-range filter this would over-count by the split factor. |
   | `testCountStarWithFilterRoutesThroughSlowPath` | Gate correctness: a 
filter makes `requiredSchema` non-empty so the fast path doesn't fire; count is 
still correct via the regular read path. |
   | `testCountStarFastPathReadsLessThanFullScan` | **Direct proof the fast 
path is exercised.** A `SparkListener` tracks `inputMetrics.bytesRead` across 
all tasks of a `count(*)` and an equivalent `SELECT *` on the same table. 
Asserts `count(*).bytesRead < SELECT *.bytesRead / 2`. If the fast path were 
not taken (i.e., routing through `readBaseFile` + vectorized reader), the two 
queries would read comparable bytes. The actual ratio in practice is much 
larger than 2× — the threshold is loose to absorb MDT-setup noise in CI. |
   
   Pushed as commit `63aa982`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(spark): fast-path SELECT count(*) on COW tables via parquet footer row counts (#18769) [hudi]

Reply via email to