[ https://issues.apache.org/jira/browse/KAFKA-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Randall Hauch resolved KAFKA-9051. ---------------------------------- Resolution: Fixed > Source task source offset reads can block graceful shutdown > ----------------------------------------------------------- > > Key: KAFKA-9051 > URL: https://issues.apache.org/jira/browse/KAFKA-9051 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 1.0.2, 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1, 2.4.0, 2.5.0 > Reporter: Chris Egerton > Assignee: Chris Egerton > Priority: Major > Fix For: 2.0.2, 2.1.2, 2.2.3, 2.5.0, 2.3.2, 2.4.1 > > > When source tasks request source offsets from the framework, this results in > a call to > [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79] > with no timeout. In distributed workers, the future is blocked on a > successful [read to the > end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136] > of the source offsets topic, which in turn will [poll that topic > indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287] > until the latest messages for every partition of that topic have been > consumed. > This normally completes in a reasonable amount of time. However, if the > connectivity between the Connect worker and the Kafka cluster is degraded or > dropped in the middle of one of these reads, it will block until connectivity > is restored and the request completes successfully. > If a task is stopped (due to a manual restart via the REST API, a rebalance, > worker shutdown, etc.) while blocked on a read of source offsets during its > {{start}} method, not only will it fail to gracefully stop, but the framework > [will not even invoke its stop > method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183] > until its {{start}} method (and, as a result, the source offset read > request) [has > completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206]. > This prevents the task from being able to clean up any resources it has > allocated and can lead to OOM errors, excessive thread creation, and other > problems. > > I've confirmed that this affects every release of Connect back through 1.0 at > least; I've tagged the most recent bug fix release of every major/minor > version from then on in the {{Affects Version/s}} field to avoid just putting > every version in that field. -- This message was sent by Atlassian Jira (v8.3.4#803005)