swaminathanmanish opened a new pull request, #18517:
URL: https://github.com/apache/pinot/pull/18517

   ## Summary
   
   `MinionSubTaskHighWaitTime` alert does not self-resolve after the minion 
queue drains because it is based on `ControllerTimer.SUBTASK_WAITING_TIME`, a 
histogram. Histogram `_Max` retains its peak value across emission cycles and 
does not decay when there are no longer any waiting subtasks, causing the alert 
to stay firing indefinitely.
   
   - Add a new `MAX_SUBTASK_WAIT_TIME_MS` gauge in `ControllerGauge` 
(per-table, non-global)
   - In `TaskMetricsEmitter`, replace the timer emission with per-`(table, 
taskType)` gauge emission: the max wait time across all waiting subtasks, or 
`0` when none are waiting
   - Clean up the gauge in `removeTableTaskTypeMetrics` when a task type/table 
is retired
   - Update `TaskMetricsEmitterTest` to reflect the new metric and assert 
correct gauge values
   
   The gauge is written every emit cycle and self-resolves when the queue 
clears. Alert rule update (separate config repo):
   ```
   expr: max(pinot_controller_MaxSubtaskWaitTimeMs) by (exported_table, 
taskType) > 14400000
   ```
   
   ## Test plan
   
   - [ ] `TaskMetricsEmitterTest` updated with assertions for 
`maxSubtaskWaitTimeMs` gauge values per table (3000ms for waiting table, 0ms 
for non-waiting table)
   - [ ] Run `./mvnw -pl pinot-controller -am -Dtest=TaskMetricsEmitterTest 
test`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to