linliu-code commented on PR #18770: URL: https://github.com/apache/hudi/pull/18770#issuecomment-4484130053
Added a test class covering the fast path. **All five tests pass locally** (`tests=5 errors=0 failures=0`, 18.9s). `hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCountStarFastPath.scala`: | Test | What it proves | |---|---| | `testCountStarPartitionedCOW` | Basic correctness on a partitioned COW table. | | `testCountStarUnpartitionedCOW` | Basic correctness on an unpartitioned table. | | `testCountStarOnSplittableFiles` | **Regression for the split-aware filter** (reviewer's catch). Forces splits via small `spark.sql.files.maxPartitionBytes` + small `hoodie.parquet.block.size`. Verifies the FileScan ends up with **more partitions than files** (so splits actually happened) and that `count(*)` is still exact. Without the row-group-range filter this would over-count by the split factor. | | `testCountStarWithFilterRoutesThroughSlowPath` | Gate correctness: a filter makes `requiredSchema` non-empty so the fast path doesn't fire; count is still correct via the regular read path. | | `testCountStarFastPathReadsLessThanFullScan` | **Direct proof the fast path is exercised.** A `SparkListener` tracks `inputMetrics.bytesRead` across all tasks of a `count(*)` and an equivalent `SELECT *` on the same table. Asserts `count(*).bytesRead < SELECT *.bytesRead / 2`. If the fast path were not taken (i.e., routing through `readBaseFile` + vectorized reader), the two queries would read comparable bytes. The actual ratio in practice is much larger than 2× — the threshold is loose to absorb MDT-setup noise in CI. | Pushed as commit `63aa982`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
