fhan688 opened a new pull request, #18882:
URL: https://github.com/apache/hudi/pull/18882
### Describe the issue this Pull Request addresses
Spark UI does not report input records or input bytes for Hudi Merge-On-Read
reads that are executed through Hudi's custom `HoodieMergeOnReadRDDV2` path.
The data is read correctly, but Spark task input metrics remain incomplete
because this custom RDD path bypasses Spark's standard file scan metric
accounting.
This makes it harder to inspect Hudi MOR read jobs from Spark UI and event
logs, especially for metadata-table reads and other paths that still use
`HoodieMergeOnReadRDDV2` directly.
No GitHub issue is filed for this change.
### Summary and Changelog
Add Spark input metric reporting for `HoodieMergeOnReadRDDV2` so Spark UI
can show input records and input bytes for Hudi's custom MOR RDD read path.
Changes:
- Add `HoodieSparkInputMetricsUtils` under the `org.apache.spark` package to
access Spark's input metrics and filesystem bytes-read callback.
- Wrap the final `HoodieMergeOnReadRDDV2.compute()` result iterator so each
successfully returned row increments `taskMetrics.inputMetrics.recordsRead`.
- Capture Spark's `getFSBytesReadOnThreadCallback()` at task start and
increment `taskMetrics.inputMetrics.bytesRead` when the task completes.
- Preserve the existing close lifecycle by closing the original underlying
iterator on task completion, including when the returned iterator is a filtered
wrapper.
- Keep the change scoped to the custom RDD path. The
`HoodieFileGroupReaderBasedFileFormat` path is not changed, since Spark's
standard file scan execution owns metric accounting there.
### Impact
User-facing behavior:
- Spark UI and Spark event logs can report input records and input bytes for
Hudi MOR reads that run through `HoodieMergeOnReadRDDV2`.
- No query result semantics change.
- No new configuration, public API, storage format, or table metadata change.
Performance:
- Low overhead: one metric increment per returned row and one bytes-read
metric update on task completion.
### Risk Level
low
The implementation is limited to Spark's custom `HoodieMergeOnReadRDDV2`
read path. It does not add manual metrics to Spark `FileFormat` scan paths,
avoiding double-counting where Spark already owns input metric updates.
Verification performed:
```bash
mvn -pl hudi-spark-datasource/hudi-spark-common -am \
-DskipTests -DskipITs -Drat.skip=true -Dcheckstyle.skip=true test-compile
```
The build completed successfully, including Scala compilation and scalastyle
for `hudi-spark-common_2.12`.
### Documentation Update
none
No new user-facing option is introduced and no documented configuration
behavior changes. This only fills Spark task input metrics for an existing read
path.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]