Chris Egerton created KAFKA-9051:
------------------------------------

             Summary: Source task source offset reads can block graceful 
shutdown
                 Key: KAFKA-9051
                 URL: https://issues.apache.org/jira/browse/KAFKA-9051
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
    Affects Versions: 2.2.1, 2.3.0, 2.1.1, 2.0.1, 1.1.1, 1.0.2, 2.4.0, 2.5.0
            Reporter: Chris Egerton
            Assignee: Chris Egerton


When source tasks request source offsets from the framework, this results in a 
call to 
[Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79]
 with no timeout. In distributed workers, the future is blocked on a successful 
[read to the 
end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136]
 of the source offsets topic, which in turn will [poll that topic 
indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287]
 until the latest messages for every partition of that topic have been consumed.

This normally completes in a reasonable amount of time. However, if the 
connectivity between the Connect worker and the Kafka cluster is degraded or 
dropped in the middle of one of these reads, it will block until connectivity 
is restored and the request completes successfully.

If a task is stopped (due to a manual restart via the REST API, a rebalance, 
worker shutdown, etc.) while blocked on a read of source offsets during its 
{{start}} method, not only will it fail to gracefully stop, but the framework 
[will not even invoke its stop 
method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183]
 until its {{start}} method (and, as a result, the source offset read request) 
[has 
completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206].
 This prevents the task from being able to clean up any resources it has 
allocated and can lead to OOM errors, excessive thread creation, and other 
problems.

 

I've confirmed that this affects every release of Connect back through 1.0 at 
least; I've tagged the most recent bug fix release of every major/minor version 
from then on in the {{Affects Version/s}} field to avoid just putting every 
version in that field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to