linliu-code commented on issue #18769: URL: https://github.com/apache/hudi/issues/18769#issuecomment-4475772337
## Update: cross-version replay — same issue in 0.15.0 and 0.15.1-rc1 Ran the same probe (100 partitions × 10K rows COW, count(*) with MDT + col-stats + data skipping enabled) against three Hudi bundles, all Spark 3.4.3 / Scala 2.12 / Java 11: | Version | Files | On-disk | Wall (median) | bytesRead (median) | Amp vs disk | |---|---|---|---|---|---| | 0.15.0 | 100 | 51.0 MB | 177 ms | 84.3 MB | 1.65× | | 0.15.1-rc1 | 100 | 51.1 MB | 168 ms | 84.3 MB | 1.65× | | 1.1.1 | 100 | 51.0 MB | 209 ms | 84.3 MB | 1.65× | Raw parquet baseline at this scale (from the body's measurements): bytesRead ≈ 376 KB, so the bytesRead-vs-raw ratio is ~224× for all three Hudi versions. **Two takeaways:** 1. **Not a 1.x regression.** The missing count(*) fast-path goes back to at least 0.15.0. The implementation moved from `HoodieParquetFileFormat` (0.15.x) to `HoodieFileGroupReaderBasedFileFormat` (1.x), but neither version short-circuits on `requiredSchema.isEmpty`. If a backport is desired, the 0.15.x reader needs an analogous fix. 2. **1.1.1 has ~20% more wall at the same bytesRead** vs 0.15.x. Likely CPU overhead in the new file-group-reader wrapper path, not a bytesRead difference. Probably worth a separate look but secondary to this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
