Re: [PR] KAFKA-16446 Log slow controller events [kafka]

via GitHub Fri, 13 Dec 2024 18:21:02 -0800


mumrah commented on PR #15622:
URL: https://github.com/apache/kafka/pull/15622#issuecomment-2542669787


   @cmccabe Thanks for taking a look!
   
   > any event that lasted longer than that was so bad, so egregious, that it 
should always be logged
   
   I thought about this for a while and couldn’t come up with a good threshold. 
Looking at our CCloud data some clusters run at 10ms average event times so an 
event of 200ms would be interesting to observe. Other clusters, we are seeing 
average of 100ms event times, so 200ms isn’t so interesting. That’s what led me 
to taking a statistical approach.
   
   However, we could definitely add an “always log above this threshold” as a 
separate thing (with a unique log line).
   
   > Question, though: why can’t we set the logging interval to 60 seconds and 
just log the longest event unconditionally?
   
   We could, though that could make finding some rare event a bit more 
difficult. Also, if we had a burst of slow events, we would only log one 
instead of all that were above p99 (rare, but possible due to the histogram 
behavior).
   
   > Perhaps call it EventPerformanceMonitor?
   
   Seems fine to me. Like you said, we could evolve this to capture more stuff 
in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] KAFKA-16446 Log slow controller events [kafka]

Reply via email to