[ 
https://issues.apache.org/jira/browse/KAFKA-12726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335389#comment-17335389
 ] 

Chris Egerton commented on KAFKA-12726:
---------------------------------------

[~ryannedolan] can you confirm that this has been observed with 2.8.0? This 
sounds similar to https://issues.apache.org/jira/browse/KAFKA-10792, which 
should be fixed in 2.8.0.

I'm also not sure about the statement that "Workers stop Tasks sequentially"–my 
understanding is that workers _trigger_ task stops sequentially (see 
[Worker::stopTasks|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L849-L855],
 which invokes 
[Worker::stopTask|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L827-L847]
 consecutively for each to-be-stopped task), but we can see under the hood in 
[WorkerTask::stop and 
WorkerTask::triggerStop|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerTask.java#L105-L120]
 that this should complete almost immediately, and neither 
[WorkerSinkTask|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L157-L162]
 nor 
[WorkerSourceTask|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L215-L219]
 override that method with anything that should block, either. The graceful 
shutdown period is then enforced en masse in 
[Worker::awaitStopTasks|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L880-L887].

Given this, I'd be very curious about a reproduction case in the event that my 
misunderstanding is complete or there is an edge case that is not accounted for 
currently. Perhaps we could leverage the existing 
[BlockingConnectorTest|https://github.com/apache/kafka/blob/f9de25f046452b2a6d916e6bca41e31d49bbdecf/connect/runtime/src/test/java/org/apache/kafka/connect/integration/BlockingConnectorTest.java]
 integration test to try to demonstrate where things break down?

> misbehaving Task.stop() can prevent other Tasks from stopping
> -------------------------------------------------------------
>
>                 Key: KAFKA-12726
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12726
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.8.0
>            Reporter: Ryanne Dolan
>            Assignee: Ryanne Dolan
>            Priority: Minor
>
> We've observed a misbehaving Task fail to stop in a timely manner (e.g. stuck 
> in a retry loop). Despite Connect supporting a property 
> task.shutdown.graceful.timeout.ms, this is currently not enforced -- tasks 
> can take as long as they want to stop, and the only consequence is an error 
> message.
> Unfortunately, Workers stop Tasks sequentially, meaning that a stuck Task can 
> prevent any further Tasks from stopping. Moreover, after a rebalance, these 
> lingering tasks can persist along with their replacements. For example, we've 
> seen a Worker's "task-count" metric double following a rebalance.
> While the Connector implementation is ultimately to blame here -- a Task 
> probably shouldn't loop forever in stop() -- we believe the Connect runtime 
> should handle this situation more gracefully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to