[ 
https://issues.apache.org/jira/browse/KAFKA-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch resolved KAFKA-9051.
----------------------------------
    Resolution: Fixed

> Source task source offset reads can block graceful shutdown
> -----------------------------------------------------------
>
>                 Key: KAFKA-9051
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9051
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 1.0.2, 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1, 2.4.0, 2.5.0
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>             Fix For: 2.0.2, 2.1.2, 2.2.3, 2.5.0, 2.3.2, 2.4.1
>
>
> When source tasks request source offsets from the framework, this results in 
> a call to 
> [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79]
>  with no timeout. In distributed workers, the future is blocked on a 
> successful [read to the 
> end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136]
>  of the source offsets topic, which in turn will [poll that topic 
> indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287]
>  until the latest messages for every partition of that topic have been 
> consumed.
> This normally completes in a reasonable amount of time. However, if the 
> connectivity between the Connect worker and the Kafka cluster is degraded or 
> dropped in the middle of one of these reads, it will block until connectivity 
> is restored and the request completes successfully.
> If a task is stopped (due to a manual restart via the REST API, a rebalance, 
> worker shutdown, etc.) while blocked on a read of source offsets during its 
> {{start}} method, not only will it fail to gracefully stop, but the framework 
> [will not even invoke its stop 
> method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183]
>  until its {{start}} method (and, as a result, the source offset read 
> request) [has 
> completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206].
>  This prevents the task from being able to clean up any resources it has 
> allocated and can lead to OOM errors, excessive thread creation, and other 
> problems.
>  
> I've confirmed that this affects every release of Connect back through 1.0 at 
> least; I've tagged the most recent bug fix release of every major/minor 
> version from then on in the {{Affects Version/s}} field to avoid just putting 
> every version in that field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to