aho135 commented on PR #18598:
URL: https://github.com/apache/druid/pull/18598#issuecomment-3374783855

   > Thanks for the fix, @aho135 ! I have left some minor suggestions.
   > 
   > Could you share some screenshots where we can see stale metrics being 
reported?
   
   Thanks for the review @kfaraz!
   
   This is the ingest/kafka/partitionlag metric being emitted by 2 
Coordinators. The active one is emitting the proper metric (0) but the previous 
Coordinator is emitting a stale metric that doesn't get reset until we manually 
restart it. The scenario we run into is that if a leader change occurs while 
there is lag on a topic then the old Coordinator continues to emit that stale 
lag metric. We have some alerting set up for lag, so the stale value ends up 
triggering false alarms.
   
   <img width="1645" height="422" alt="Screenshot 2025-10-06 at 1 59 05 PM" 
src="https://github.com/user-attachments/assets/25ed5bf2-0986-478c-880d-8075c0978739";
 />
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to