[
https://issues.apache.org/jira/browse/KAFKA-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Randall Hauch resolved KAFKA-9051.
----------------------------------
Resolution: Fixed
> Source task source offset reads can block graceful shutdown
> -----------------------------------------------------------
>
> Key: KAFKA-9051
> URL: https://issues.apache.org/jira/browse/KAFKA-9051
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 1.0.2, 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1, 2.4.0, 2.5.0
> Reporter: Chris Egerton
> Assignee: Chris Egerton
> Priority: Major
> Fix For: 2.0.2, 2.1.2, 2.2.3, 2.5.0, 2.3.2, 2.4.1
>
>
> When source tasks request source offsets from the framework, this results in
> a call to
> [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79]
> with no timeout. In distributed workers, the future is blocked on a
> successful [read to the
> end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136]
> of the source offsets topic, which in turn will [poll that topic
> indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287]
> until the latest messages for every partition of that topic have been
> consumed.
> This normally completes in a reasonable amount of time. However, if the
> connectivity between the Connect worker and the Kafka cluster is degraded or
> dropped in the middle of one of these reads, it will block until connectivity
> is restored and the request completes successfully.
> If a task is stopped (due to a manual restart via the REST API, a rebalance,
> worker shutdown, etc.) while blocked on a read of source offsets during its
> {{start}} method, not only will it fail to gracefully stop, but the framework
> [will not even invoke its stop
> method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183]
> until its {{start}} method (and, as a result, the source offset read
> request) [has
> completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206].
> This prevents the task from being able to clean up any resources it has
> allocated and can lead to OOM errors, excessive thread creation, and other
> problems.
>
> I've confirmed that this affects every release of Connect back through 1.0 at
> least; I've tagged the most recent bug fix release of every major/minor
> version from then on in the {{Affects Version/s}} field to avoid just putting
> every version in that field.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)