mumrah commented on PR #15622: URL: https://github.com/apache/kafka/pull/15622#issuecomment-2542669787
@cmccabe Thanks for taking a look! > any event that lasted longer than that was so bad, so egregious, that it should always be logged I thought about this for a while and couldn’t come up with a good threshold. Looking at our CCloud data some clusters run at 10ms average event times so an event of 200ms would be interesting to observe. Other clusters, we are seeing average of 100ms event times, so 200ms isn’t so interesting. That’s what led me to taking a statistical approach. However, we could definitely add an “always log above this threshold” as a separate thing (with a unique log line). > Question, though: why can’t we set the logging interval to 60 seconds and just log the longest event unconditionally? We could, though that could make finding some rare event a bit more difficult. Also, if we had a burst of slow events, we would only log one instead of all that were above p99 (rare, but possible due to the histogram behavior). > Perhaps call it EventPerformanceMonitor? Seems fine to me. Like you said, we could evolve this to capture more stuff in the future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
