[PR] feat(spark): add input records & bytes metrics [hudi]

via GitHub Fri, 29 May 2026 04:44:39 -0700


fhan688 opened a new pull request, #18882:
URL: https://github.com/apache/hudi/pull/18882


   ### Describe the issue this Pull Request addresses
   
   Spark UI does not report input records or input bytes for Hudi Merge-On-Read 
reads that are executed through Hudi's custom `HoodieMergeOnReadRDDV2` path. 
The data is read correctly, but Spark task input metrics remain incomplete 
because this custom RDD path bypasses Spark's standard file scan metric 
accounting.
   
   This makes it harder to inspect Hudi MOR read jobs from Spark UI and event 
logs, especially for metadata-table reads and other paths that still use 
`HoodieMergeOnReadRDDV2` directly.
   
   No GitHub issue is filed for this change.
   
   ### Summary and Changelog
   
   Add Spark input metric reporting for `HoodieMergeOnReadRDDV2` so Spark UI 
can show input records and input bytes for Hudi's custom MOR RDD read path.
   
   Changes:
   
   - Add `HoodieSparkInputMetricsUtils` under the `org.apache.spark` package to 
access Spark's input metrics and filesystem bytes-read callback.
   - Wrap the final `HoodieMergeOnReadRDDV2.compute()` result iterator so each 
successfully returned row increments `taskMetrics.inputMetrics.recordsRead`.
   - Capture Spark's `getFSBytesReadOnThreadCallback()` at task start and 
increment `taskMetrics.inputMetrics.bytesRead` when the task completes.
   - Preserve the existing close lifecycle by closing the original underlying 
iterator on task completion, including when the returned iterator is a filtered 
wrapper.
   - Keep the change scoped to the custom RDD path. The 
`HoodieFileGroupReaderBasedFileFormat` path is not changed, since Spark's 
standard file scan execution owns metric accounting there.
   
   ### Impact
   
   User-facing behavior:
   
   - Spark UI and Spark event logs can report input records and input bytes for 
Hudi MOR reads that run through `HoodieMergeOnReadRDDV2`.
   - No query result semantics change.
   - No new configuration, public API, storage format, or table metadata change.
   
   Performance:
   
   - Low overhead: one metric increment per returned row and one bytes-read 
metric update on task completion.
   
   ### Risk Level
   
   low
   
   The implementation is limited to Spark's custom `HoodieMergeOnReadRDDV2` 
read path. It does not add manual metrics to Spark `FileFormat` scan paths, 
avoiding double-counting where Spark already owns input metric updates.
   
   Verification performed:
   
   ```bash
   mvn -pl hudi-spark-datasource/hudi-spark-common -am \
     -DskipTests -DskipITs -Drat.skip=true -Dcheckstyle.skip=true test-compile
   ```
   
   The build completed successfully, including Scala compilation and scalastyle 
for `hudi-spark-common_2.12`.
   
   ### Documentation Update
   
   none
   
   No new user-facing option is introduced and no documented configuration 
behavior changes. This only fills Spark task input metrics for an existing read 
path.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(spark): add input records & bytes metrics [hudi]

Reply via email to