[I] Improve PrometheusReporter to support multi-table scenarios with reference counting [hudi]

via GitHub Sun, 30 Nov 2025 05:11:47 -0800


hudi-bot opened a new issue, #17200:
URL: https://github.com/apache/hudi/issues/17200


   h2. Summary
   
   Currently, {{PrometheusReporter}} stops the HTTP server when any table 
stops, breaking metrics for other tables in multi-table scenarios. This 
enhancement adds reference counting to only stop the server when no tables are 
using it.
   h2. Current Behavior
   
   Multiple tables can share the same Prometheus server during startup (thanks 
to HUDI-7083), but when any table stops, the entire server is stopped and 
removed from static maps. This causes other tables to lose access to metrics 
and prevents them from emitting new metrics. While this works fine for 
single-table-per-job scenarios, it breaks multi-table use cases.
   h2. Problem
   
   Users running multiple Hudi tables in a single Spark job with pause/resume 
functionality experience broken metrics when individual tables are paused or 
stopped.
   h2. Current Behavior (Problematic Flow)
    # Table A starts → Creates Prometheus server on port 9091
    # Table B starts → Reuses same server (works thanks to HUDI-7083)
    # *Table A pauses* → Calls {{PrometheusReporter.stop()}} → *Stops entire 
server*
    # Table B continues → No server available → *Metrics broken*
   
   h2. Proposed Enhancement
   
   Add reference counting to track how many tables are using each server port:
    * Add PORT_TO_REFERENCE_COUNT and PORT_TO_EXPORTS maps
    * Increment count when table starts using server
    * Decrement count when table stops, only stop server when count reaches zero
    * Add utility methods for debugging (isServerRunning, getReferenceCount)
   
   h2. Future Behavior
    # Table A starts → Server starts, reference count = 1
    # Table B starts → Server reused, reference count = 2
    # *Table A pauses* → Reference count = 1, *server stays alive*
    # Table B continues → *Metrics continue working*
    # Table B stops → Reference count = 0, server stops gracefully
   
   h2. Implementation Details
   
   The implementation involves modifying the {{PrometheusReporter}} constructor 
to increment reference counts and register exports in a thread-safe manner. The 
stop() method would be updated to decrement reference counts and only perform 
cleanup when no references remain. We would also add utility methods for 
debugging and monitoring server status. All changes would be backward 
compatible, ensuring single-table usage continues to work exactly as before.
   h2. Backward Compatibility
    * Single table usage unchanged - works exactly as before
    * No breaking changes
    * Reference count goes 1→0 for single table, server stops as before
   
   h2. Use Cases
    * Multi-table Spark jobs with pause/resume functionality
    * Shared Prometheus server across multiple Hudi tables
    * Better resource management for Prometheus servers
   
   h2. Testing
    * {*}Multi-table reference counting{*}: Basic 2-table scenario
    * {*}Concurrent access{*}: 10 threads creating reporters simultaneously
    * {*}Port isolation{*}: Different ports work independently
    * {*}Partial failure{*}: One table fails while others continue
    * {*}Thread safety{*}: All operations properly synchronized
    * {*}Backward compatibility{*}: Single table behavior unchanged
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-9793
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve PrometheusReporter to support multi-table scenarios with reference counting [hudi]

Reply via email to