Steven Blumenthal created KAFKA-8936:
----------------------------------------
Summary: Connect metrics have a chance to disappear on rebalance
Key: KAFKA-8936
URL: https://issues.apache.org/jira/browse/KAFKA-8936
Project: Kafka
Issue Type: Bug
Components: KafkaConnect, metrics
Affects Versions: 2.0.0
Reporter: Steven Blumenthal
We encountered an interesting problem with our connect cluster. At times,
seemingly randomly, some connect sink task metrics would randomly disappear
from Datadog (which is where we are sending these metrics to). After some
investigation, I noticed that the metrics in question weren't being reported by
the connect servers themselves.
After some more investigation, I noticed that the metrics stopped reporting
after a rebalance was triggered. Our logs were filled with "Graceful stop of
task ... failed". So, further digging to understand what was happening in the
code when this happens, it appears that this error means that the stopping of
tasks timed out for whatever reason, and the connect cluster will no longer
wait for them to stop. They will still stop eventually, but in the meantime new
tasks can be spun up.
([Worker.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587]],
which calls
[WorkerTask.java:cancel()|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120]])
So, new tasks are being spun up, and begin consuming records and doing work.
Then, at some point, the old task is removed, and the very last thing that
happens when the old task is removed is that the metric group associated with
that task is removed.
([WorkerTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232]]
which, in this case, calls
[WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179]])
The issue with this is that task based metrics are registered based on a set of
tags that one would expect to not change during runtime. Meaning that, when the
old task IS EVENTUALLY REMOVED, it is removing the metric group that the new
task is using (if the new task came up on the same connect node that the old
task was running on).
([WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721]])
I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times
what it had previously been set to, however that did not completely remove the
problem. Also, even if it did, it doesn't change the fact that a minor network
blip on my connect cluster could result in us needing to redeploy the code
simply because metrics went missing due to task shut downs taking longer than
intended.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)