[ 
https://issues.apache.org/jira/browse/KAFKA-8936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Blumenthal updated KAFKA-8936:
-------------------------------------
    Description: 
We encountered an interesting problem with our connect cluster. At times, 
seemingly randomly, some connect sink task metrics would randomly disappear 
from Datadog (which is where we are sending these metrics to). After some 
investigation, I noticed that the metrics in question weren't being reported by 
the connect servers themselves.

After some more investigation, I noticed that the metrics stopped reporting 
after a rebalance was triggered. Our logs were filled with "Graceful stop of 
task ... failed". So, further digging to understand what was happening in the 
code when this happens, it appears that this error means that the stopping of 
tasks timed out for whatever reason, and the connect cluster will no longer 
wait for them to stop. They will still stop eventually, but in the meantime new 
tasks can be spun up. 
(https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587,
 which calls 
https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120)

So, new tasks are being spun up, and begin consuming records and doing work. 
Then, at some point, the old task is removed, and the very last thing that 
happens when the old task is removed is that the metric group associated with 
that task is removed. 
(https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232
 which, in this case, calls 
https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179)

The issue with this is that task based metrics are registered based on a set of 
tags that one would expect to not change during runtime. Meaning that, when the 
old task IS EVENTUALLY REMOVED, it is removing the metric group that the new 
task is using (if the new task came up on the same connect node that the old 
task was running on). 
(https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721)

I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times 
what it had previously been set to, however that did not completely remove the 
problem. Also, even if it did, it doesn't change the fact that a minor network 
blip on my connect cluster could result in us needing to redeploy the code 
simply because metrics went missing due to task shut downs taking longer than 
intended.

  was:
We encountered an interesting problem with our connect cluster. At times, 
seemingly randomly, some connect sink task metrics would randomly disappear 
from Datadog (which is where we are sending these metrics to). After some 
investigation, I noticed that the metrics in question weren't being reported by 
the connect servers themselves.

After some more investigation, I noticed that the metrics stopped reporting 
after a rebalance was triggered. Our logs were filled with "Graceful stop of 
task ... failed". So, further digging to understand what was happening in the 
code when this happens, it appears that this error means that the stopping of 
tasks timed out for whatever reason, and the connect cluster will no longer 
wait for them to stop. They will still stop eventually, but in the meantime new 
tasks can be spun up. 
([Worker.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587]],
 which calls 
[WorkerTask.java:cancel()|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120]])

So, new tasks are being spun up, and begin consuming records and doing work. 
Then, at some point, the old task is removed, and the very last thing that 
happens when the old task is removed is that the metric group associated with 
that task is removed. 
([WorkerTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232]]
 which, in this case, calls 
[WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179]])

The issue with this is that task based metrics are registered based on a set of 
tags that one would expect to not change during runtime. Meaning that, when the 
old task IS EVENTUALLY REMOVED, it is removing the metric group that the new 
task is using (if the new task came up on the same connect node that the old 
task was running on). 
([WorkerSinkTask.java|[https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721]])

I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times 
what it had previously been set to, however that did not completely remove the 
problem. Also, even if it did, it doesn't change the fact that a minor network 
blip on my connect cluster could result in us needing to redeploy the code 
simply because metrics went missing due to task shut downs taking longer than 
intended.


> Connect metrics have a chance to disappear on rebalance
> -------------------------------------------------------
>
>                 Key: KAFKA-8936
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8936
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect, metrics
>    Affects Versions: 2.0.0
>            Reporter: Steven Blumenthal
>            Priority: Minor
>
> We encountered an interesting problem with our connect cluster. At times, 
> seemingly randomly, some connect sink task metrics would randomly disappear 
> from Datadog (which is where we are sending these metrics to). After some 
> investigation, I noticed that the metrics in question weren't being reported 
> by the connect servers themselves.
> After some more investigation, I noticed that the metrics stopped reporting 
> after a rebalance was triggered. Our logs were filled with "Graceful stop of 
> task ... failed". So, further digging to understand what was happening in the 
> code when this happens, it appears that this error means that the stopping of 
> tasks timed out for whatever reason, and the connect cluster will no longer 
> wait for them to stop. They will still stop eventually, but in the meantime 
> new tasks can be spun up. 
> (https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L587,
>  which calls 
> https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L120)
> So, new tasks are being spun up, and begin consuming records and doing work. 
> Then, at some point, the old task is removed, and the very last thing that 
> happens when the old task is removed is that the metric group associated with 
> that task is removed. 
> (https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L232
>  which, in this case, calls 
> https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L179)
> The issue with this is that task based metrics are registered based on a set 
> of tags that one would expect to not change during runtime. Meaning that, 
> when the old task IS EVENTUALLY REMOVED, it is removing the metric group that 
> the new task is using (if the new task came up on the same connect node that 
> the old task was running on). 
> (https://github.com/apache/kafka/blob/2.0.0/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L721)
> I tried increasing the "task.shutdown.graceful.timeout.ms" config by 3 times 
> what it had previously been set to, however that did not completely remove 
> the problem. Also, even if it did, it doesn't change the fact that a minor 
> network blip on my connect cluster could result in us needing to redeploy the 
> code simply because metrics went missing due to task shut downs taking longer 
> than intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to