linliu-code commented on issue #18769: URL: https://github.com/apache/hudi/issues/18769#issuecomment-4476018537
## Update: fix implemented on master and validated locally Applied a fast-path patch to `HoodieFileGroupReaderBasedFileFormat` (lines in `hudi-spark-common` shared module, so the fix benefits Spark 3.3 / 3.4 / 3.5 / 4.0 bundles when rebuilt). The patch: 1. Adds an `if (isCount) readCountFromFooter(...)` branch before the existing per-file lambda's pattern match. 2. `readCountFromFooter` reads only the parquet footer (`ParquetFileReader.readFooter(..., NO_FILTER)`), sums `BlockMetaData.getRowCount()` across row groups, and emits either `ColumnarBatch` (when the downstream is vectorized) or `InternalRow`. Partition columns are populated as constants from `file.partitionValues` so codegen that touches column[i] still sees valid data. Patch is +84 / -2 lines. One method added, one branch in the lambda, three imports (`HadoopFSUtils`, `ParquetMetadataConverter`, `ParquetFileReader`, plus `ConstantColumnVector` and `ColumnVectorUtils` for the columnar path). ### Validation against the same probe as the issue body Built `hudi-spark3.4-bundle_2.12-1.3.0-SNAPSHOT.jar` from patched master and ran the original count(*) probe at two scales: | Scale | partitions × rows/part | Hudi count | Raw count | Hudi wall | Raw wall | Wall ratio | |---|---|---|---|---|---|---| | S | 1000 × 10 | 10,000 ✓ | 10,000 ✓ | 313 ms | 296 ms | **1.06×** | | L | 100 × 10,000 | 1,000,000 ✓ | 1,000,000 ✓ | 73 ms | 59 ms | **1.24×** | Compare to pre-patch ratios from the issue body: 2.76× at S, 2.18× at L. The fix essentially closes the wall-clock gap. bytesRead is also approximately halved (441 MB → from 882 MB at S; 44 MB → from 88 MB at L), but the residual ~50% appears to come from Hudi's larger embedded footer (col-stats, bloom filter) plus driver-side MDT reads, neither of which is in this issue's scope. ### Sanity On a small 50-row table with the patched bundle: - `SELECT count(*) WHERE rk<10` → 10 (non-count path with filter, untouched by the patch) - `SELECT sum(val)` → 1225 (column-access aggregation) - `SELECT * LIMIT 5` → correct row values Non-count queries unaffected; the patch only adds an `if (isCount)` branch and falls through to the existing code path otherwise. Happy to open a PR with this patch if it would be useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
