[ 
https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17336550#comment-17336550
 ] 

Ryanne Dolan commented on KAFKA-12726:
--------------------------------------

[~ChrisEgerton] Yeah, the problem is that we were seeing, say, 100 tasks on 
each Worker, then a rebalance, then 200 tasks per Worker (as reported by the 
tasks-count metric) with nothing to do but restart each Worker -- which, ofc, 
would cause further rebalances!

I'm not proposing we interrupt any threads here. I agree with you that it's 
reasonable to just leak a thread if a Task impl is stuck indefinitely. But we 
can leak a stuck thread while cleaning up everything around it. I'm proposing 
we continue with the WorkerTask shutdown after the grace period, which includes 
removing the WorkerTask from the list of current tasks (and thus the 
tasks-count metric).

> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
>                 Key: KAFKA-12726
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12726
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: Ryanne Dolan
>            Assignee: Ryanne Dolan
>            Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
> in a retry loop). Despite Connect supporting a property 
> task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks 
> can take as long as they want to stop, and the only consequence is an error 
> message.
> Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can 
> prevent any further Tasks from stopping. Moreover, after a rebalance, these 
> lingering tasks can persist along with their replacements. For example, we've 
> seen a Worker's "task-count" metric double following a rebalance.
> While the Connector implementation is ultimately to blame here -- a Task 
> probably shouldn't loop forever in stop() -- we believe the Connect runtime 
> should handle this situation more gracefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to