hudi-bot opened a new issue, #17200:
URL: https://github.com/apache/hudi/issues/17200
h2. Summary
Currently, {{PrometheusReporter}} stops the HTTP server when any table
stops, breaking metrics for other tables in multi-table scenarios. This
enhancement adds reference counting to only stop the server when no tables are
using it.
h2. Current Behavior
Multiple tables can share the same Prometheus server during startup (thanks
to HUDI-7083), but when any table stops, the entire server is stopped and
removed from static maps. This causes other tables to lose access to metrics
and prevents them from emitting new metrics. While this works fine for
single-table-per-job scenarios, it breaks multi-table use cases.
h2. Problem
Users running multiple Hudi tables in a single Spark job with pause/resume
functionality experience broken metrics when individual tables are paused or
stopped.
h2. Current Behavior (Problematic Flow)
# Table A starts → Creates Prometheus server on port 9091
# Table B starts → Reuses same server (works thanks to HUDI-7083)
# *Table A pauses* → Calls {{PrometheusReporter.stop()}} → *Stops entire
server*
# Table B continues → No server available → *Metrics broken*
h2. Proposed Enhancement
Add reference counting to track how many tables are using each server port:
* Add PORT_TO_REFERENCE_COUNT and PORT_TO_EXPORTS maps
* Increment count when table starts using server
* Decrement count when table stops, only stop server when count reaches zero
* Add utility methods for debugging (isServerRunning, getReferenceCount)
h2. Future Behavior
# Table A starts → Server starts, reference count = 1
# Table B starts → Server reused, reference count = 2
# *Table A pauses* → Reference count = 1, *server stays alive*
# Table B continues → *Metrics continue working*
# Table B stops → Reference count = 0, server stops gracefully
h2. Implementation Details
The implementation involves modifying the {{PrometheusReporter}} constructor
to increment reference counts and register exports in a thread-safe manner. The
stop() method would be updated to decrement reference counts and only perform
cleanup when no references remain. We would also add utility methods for
debugging and monitoring server status. All changes would be backward
compatible, ensuring single-table usage continues to work exactly as before.
h2. Backward Compatibility
* Single table usage unchanged - works exactly as before
* No breaking changes
* Reference count goes 1→0 for single table, server stops as before
h2. Use Cases
* Multi-table Spark jobs with pause/resume functionality
* Shared Prometheus server across multiple Hudi tables
* Better resource management for Prometheus servers
h2. Testing
* {*}Multi-table reference counting{*}: Basic 2-table scenario
* {*}Concurrent access{*}: 10 threads creating reporters simultaneously
* {*}Port isolation{*}: Different ports work independently
* {*}Partial failure{*}: One table fails while others continue
* {*}Thread safety{*}: All operations properly synchronized
* {*}Backward compatibility{*}: Single table behavior unchanged
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-9793
- Type: Improvement
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]