Jiao Zhang created KAFKA-9616:
---------------------------------
Summary: Add new metrics to get total response time with throttle
time subtracted
Key: KAFKA-9616
URL: https://issues.apache.org/jira/browse/KAFKA-9616
Project: Kafka
Issue Type: Improvement
Components: core
Affects Versions: 1.1.0
Reporter: Jiao Zhang
We are using these RequestMetrics for our cluster monitoring
[https://github.com/apache/kafka/blob/fb5bd9eb7cdfdae8ed1ea8f68e9be5687f610b28/core/src/main/scala/kafka/network/RequestChannel.scala#L364]
and config our AlertManager to fire alerts if 99th value of 'TotalTimeMs'
exceeds the threshold value. This alert is very important as it really notifies
cluster administrators the bad situation for example when one server is bailed
out from cluster or lost leadership.
But we suffer from false alerts sometimes. This is the case. We set quota like
'producer_byte_rate' for some clients, so when requests from these clients are
throttled, 'ThrottleTimeMs' is long and sometimes due to throttle 'TotalTimeMs'
exceeds the threshold value and alert is triggered. As a result we have to put
some time to check details for false alerts either.
So this ticket proposes to add a new metrics 'ProcessTimeMs', the value of
which is total response time with throttle time subtracted. This metrics is
more accurate and could help us only notice the really unexpected situation.
Btw, we tried to achieve this by using PromQL against existing metrics, like
Total - Throttle. But it does not work as it seems these two metrics are
inconsistent in time. So better to expose a new metrics from broker side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)