[
https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17336550#comment-17336550
]
Ryanne Dolan commented on KAFKA-12726:
--------------------------------------
[~ChrisEgerton] Yeah, the problem is that we were seeing, say, 100 tasks on
each Worker, then a rebalance, then 200 tasks per Worker (as reported by the
tasks-count metric) with nothing to do but restart each Worker -- which, ofc,
would cause further rebalances!
I'm not proposing we interrupt any threads here. I agree with you that it's
reasonable to just leak a thread if a Task impl is stuck indefinitely. But we
can leak a stuck thread while cleaning up everything around it. I'm proposing
we continue with the WorkerTask shutdown after the grace period, which includes
removing the WorkerTask from the list of current tasks (and thus the
tasks-count metric).
> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
> Key: KAFKA-12726
> URL: https://issues.apache.org/jira/browse/KAFKA-12726
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 2.8.0
> Reporter: Ryanne Dolan
> Assignee: Ryanne Dolan
> Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck
> in a retry loop). Despite Connect supporting a property
> task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks
> can take as long as they want to stop, and the only consequence is an error
> message.
> Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can
> prevent any further Tasks from stopping. Moreover, after a rebalance, these
> lingering tasks can persist along with their replacements. For example, we've
> seen a Worker's "task-count" metric double following a rebalance.
> While the Connector implementation is ultimately to blame here -- a Task
> probably shouldn't loop forever in stop() -- we believe the Connect runtime
> should handle this situation more gracefully.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)