Jiao Zhang created KAFKA-9616:
---------------------------------

             Summary: Add new metrics to get total response time with throttle 
time subtracted
                 Key: KAFKA-9616
                 URL: https://issues.apache.org/jira/browse/KAFKA-9616
             Project: Kafka
          Issue Type: Improvement
          Components: core
    Affects Versions: 1.1.0
            Reporter: Jiao Zhang


We are using these RequestMetrics for our cluster monitoring 
[https://github.com/apache/kafka/blob/fb5bd9eb7cdfdae8ed1ea8f68e9be5687f610b28/core/src/main/scala/kafka/network/RequestChannel.scala#L364]

and config our AlertManager to fire alerts if 99th value of 'TotalTimeMs' 
exceeds the threshold value. This alert is very important as it really notifies 
cluster administrators the bad situation for example when one server is bailed 
out from cluster or lost leadership.

But we suffer from false alerts sometimes. This is the case. We set quota like 
'producer_byte_rate' for some clients, so when requests from these clients are 
throttled, 'ThrottleTimeMs' is long and sometimes due to throttle 'TotalTimeMs' 
exceeds the threshold value and alert is triggered. As a result we have to put 
some time to check details for false alerts either.

So this ticket proposes to add a new metrics 'ProcessTimeMs', the value of 
which is total response time with throttle time subtracted. This metrics is 
more accurate and could help us only notice the really unexpected situation.

Btw, we tried to achieve this by using PromQL against existing metrics, like 
Total - Throttle. But it does not work as it seems these two metrics are 
inconsistent in time. So better to expose a new metrics from broker side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to