[ https://issues.apache.org/jira/browse/KAFKA-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905506#comment-16905506 ]
Qinghui Xu commented on KAFKA-8790: ----------------------------------- [~pgwhalen] Thanks for the hint, I'll have a look at your PR. > [kafka-connect] KafkaBaseLog.WorkThread not recoverable > ------------------------------------------------------- > > Key: KAFKA-8790 > URL: https://issues.apache.org/jira/browse/KAFKA-8790 > Project: Kafka > Issue Type: Bug > Reporter: Qinghui Xu > Priority: Major > > We have a kafka (source) connector that's copying data from some kafka > cluster to the target cluster. The connector is deployed to a bunch of > workers running on mesos, thus the lifecycle of the workers are managed by > mesos. Workers should be recovered by mesos in case of failure, and then > source tasks will rely on kafka connect's KafkaOffsetBackingStore to recover > the offsets to proceed. > Recently we witness some unrecoverable situation, though: worker is not doing > anything after some network reset on the host where the worker is running. > More specifically, it seems that the kafka connect tasks' on that worker stop > to poll source kafka cluster, because the consumers are stuck in a rebalance > state. > After some digging, we found that the thread to handle the source task offset > recovery is dead, which makes the all rebalancing tasks stuck in the state of > reading back the offset. The log we saw in our connect task: > {code:java} > 2019-08-12 14:29:28,089 ERROR Unexpected exception in Thread[KafkaBasedLog > Work Thread - kc_replicator_offsets,5,main] > (org.apache.kafka.connect.util.KafkaBasedLog) > org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by > times in 30001ms{code} > As far as I can see > ([https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L339]), > the thread will be dead in case of error, while the worker is still alive, > which means a worker without the thread to recover offset thus all tasks on > that worker are not recoverable and will stuck in case of failure. > > Solution to fix this issue will ideally either of the following: > * Make the KafkaBasedLog Work Thread recoverable from error > * Or KafkaBasedLog Work Thread death should make the worker exit (a finally > clause to call System.exit), then the worker lifecycle management (in our > case, it's mesos) will restart the worker elsewhere > -- This message was sent by Atlassian JIRA (v7.6.14#76016)