[ 
https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346777#comment-16346777
 ] 

Per Steffensen commented on KAFKA-6505:
---------------------------------------

This ticket was labeled "needs-kip". I guess adding metrics does not require a 
KIP, but that a change in strategy from "exposing advanced metrics" to "expose 
simple raw metrics" will?

Should I open such a KIP?

If we just make this ticket about introducing the following metrics, will it 
still require a KIP?
 * offset-commit-attempts (number of offset commits attempted (successful or 
failing) since startup)
 * offset-commit-failures (number of offset commits that failed since startup)
 * offset-commit-duration (total time in ms used doing offset-commits 
(successful or failing) since startup)
 * offset-commit-failures-duration (total time in ms used doing failing 
offset-commits since startup)

Note that number of successful offset-commits can be calculated by 
offset-commit-attempts minus offset-commit-failures, and that the time spent on 
those can be calculated by offset-commit-duration minus 
offset-commit-failures-duration

For the KIP:

I believe that what people are doing today, is setting up a continuous poll of 
metrics (e.g. using Prometheus), and that "advanced" metrics will be done 
outside Kafka/Kafka-Connect/Kafka-clients based on the collected "raw" metrics.

E.g. setup Prometheus to pull metrics from a Kafka-Connect every 30th sec. 
Example of what has been polled (stored in Prometheus)
||Poll 
timestamp||offset-commit-attempts||offset-commit-failures||offset-commit-duration||offset-commit-failures-duration||Note||
|1/1-2018 10:00:00|10|1|1000|900|At this point the application has been running 
for at while, doing 10 offset commits, of which 9 succeeded and 1 failed. The 9 
successful took a total of 100 ms, while the failing one took 900 ms|
|1/1-2018 10:00:30|11|1|1050|900|Within the 30 secs one successful 
offset-commit was run spending 50 ms|
|1/1-2018 10:01:00|13|2|1500|1300|Within the 30 secs two offset-commits were 
run - one successful (50 ms) and one failing (400 ms)|
|1/1-2018 10:01:30|13|2|1500|1300|Within the 30 secs no offset-commits were run|
|1/1-2018 10:02:00|14|2|1530|1300|Within the 30 secs one successful 
offset-commit was run spending 30 ms|

Now if you use PromQL, e.g. through Grafana (which make very nice presentations 
of the data) you can do some of the more advanced stuff
{code:java}
idelta(offset_commit_failures[1m]){code}
Will show a graph (or whatever) of the number of failing offset-commits "here 
and now" over time: 0 at 10:00:30, 1 at 10:01:00, 0 at 10:01:30 and 0 at 
10:02:00
{code:java}
100 * (delta(offset_commit_failures[1m]) / 
delta(offset_commit_attempts[1m])){code}
Will show a graph (or whatever) of the percentage of offset-commits that failed 
over the "last" (relative to the time in the graph) minute. E.g. at 10:01:30 it 
will show 100 * ((2-1) / (13-11)) = 100 * (1 / 2) = 50. You can easily make it 
"over the last 5 minutes" or whatever.
{code:java}
1000 * delta(offset_commit_duration[5m]){code}
Will show a graph (or whatever) of the number of seconds used doing 
offset-commits (failing or succeeded) over the last 5 minutes (300 secs).
{code:java}
delta(offset_commit_duration[5m]) / delta(offset_commit_attempts[5m]){code}
Will show a graph (or whatever) of the average number of milliseconds used 
doing offset-commits (failing or succeeding) over the last 5 minutes.
{code:java}
100 * (delta(offset_commit_failure_duration[5m]) / 
delta(offset_commit_duration[5m])){code}
Will show a graph (or whatever) of the percentage of time used doing 
offset-commits that was used doing failing offset-commits, within the last 5 
minutes.

You can to a lot of such stuff a presentation-time based on raw numbers - I 
just showed a very small faction of what you can express in PromQL. If you do 
the advanced calculations in the metrics themselves, you will be limiting the 
flexibility of the end administrator, by taking decisions for him about what he 
wants to see.

> Add simple raw "offset-commit-failures", "offset-commits" and 
> "offset-commit-successes" count metric
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6505
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6505
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 1.0.0
>            Reporter: Per Steffensen
>            Priority: Minor
>              Labels: needs-kip
>
> MBean 
> "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x" 
> has several attributes. Most of them seems to be avg/max/pct over the entire 
> lifetime of the process. They are not very useful when monitoring a system, 
> where you typically want to see when there have been problems and if there 
> are problems right now.
> E.g. I would like to expose to an administrator when offset-commits have been 
> failing (e.g. timing out) including if they are failing right now. It is 
> really hard to do that properly, just using attribute 
> "offset-commit-failure-percentage". You can expose a number telling how much 
> the percentage has changed between two consecutive polls of the metric - if 
> it changed to the positive side, we saw offset-commit failures, and if it 
> changed to the negative side (or is stable at 0) we saw offset-commit success 
> - at least as long as the system has not been running for so long that a 
> single failing offset-commit does not even change the percentage. But it is 
> really odd, to do it this way.
> *I would like to just see an attribute "offset-commit-failures" just counting 
> how many offset-commits have failed, as an ever-increasing number. Maybe also 
> attributes "offset-commits" and "offset-commit-successes". Then I can do a 
> delta between the two last metric-polls to show how many 
> offset-commit-attempts have failed "very recently". Let this ticket be about 
> that particular added attribute (or the three added attributes).*
> Just a note on metrics IMHO (should probably be posted somewhere else):
> In general consider getting rid of stuff like avg, max, pct over the entire 
> lifetime of the process - current state is what interests people, especially 
> when it comes to failure-related metrics (failure-pct over the lifetime of 
> the process is not very useful). And people will continuously be polling and 
> storing the metrics, so we will have a history of "current state" somewhere 
> else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring 
> tools can do all the avg, max, pct for you based on a time-series of 
> metrics-poll-results - and they can do it for periods of your choice (e.g. 
> average over the last minute or 5 minutes) - have a look at Prometheus PromQL 
> (e.g. used through Grafana). Just expose the raw number and let the 
> average/max/min/pct calculation be done on the collect/presentation side. 
> Only do "advanced" stuff for cases that are very interesting and where it 
> cannot be done based on simple raw number (e.g. percentiles), and consider 
> whether doing it for fairly short intervals is better than for the entire 
> lifetime of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to