[ https://issues.apache.org/jira/browse/KAFKA-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899851#comment-16899851 ]
Ivan Yurchenko commented on KAFKA-7941: --------------------------------------- We got hit by this issue as well. In our case it makes a Connect cluster totally non-operating, when a {{WorkerCoordinator}} can't create assignments because it can't read the latest connector config from Kafka and coordinators get into an infinite loop of {noformat} INFO [Worker clientId=connect-1, groupId=connect] Was selected to perform assignments, but do not have latest config found in sync request. Returning an empty configuration to trigger re-sync. (org.apache.kafka.connect.runtime.distributed.WorkerCoordinator:208) INFO [GroupCoordinator 3]: Assignment received from leader for group connect for generation 436 (kafka.coordinator.group.GroupCoordinator) INFO [Worker clientId=connect-1, groupId=connect] Successfully joined group with generation 436 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:455) INFO Joined group and got assignment: Assignment{error=1, leader='connect-1-caf0b504-cb29-4456-a28d-3172cdf67d73', leaderUrl='http://test-xps7h6wknyd-3.aiven.local:8083/', offset=1, connectorIds=[], taskIds=[]} (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1216) INFO [Worker clientId=connect-1, groupId=connect] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:491) INFO [GroupCoordinator 3]: Preparing to rebalance group connect in state PreparingRebalance with old generation 436 (__consumer_offsets-30) (reason: Updating metadata for member connect-1-caf0b504-cb29-4456-a28d-3172cdf67d73) (kafka.coordinator.group.GroupCoordinator) INFO [GroupCoordinator 3]: Stabilized group connect generation 437 (__consumer_offsets-30) (kafka.coordinator.group.GroupCoordinator) {noformat} Thank your for reporting and fixing, [~pgwhalen]. > Connect KafkaBasedLog work thread terminates when getting offsets fails > because broker is unavailable > ----------------------------------------------------------------------------------------------------- > > Key: KAFKA-7941 > URL: https://issues.apache.org/jira/browse/KAFKA-7941 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.0.0 > Reporter: Paul Whalen > Assignee: Paul Whalen > Priority: Minor > > My team has run into this Connect bug regularly in the last six months while > doing infrastructure maintenance that causes intermittent broker availability > issues. I'm a little surprised it exists given how routinely it affects us, > so perhaps someone in the know can point out if our setup is somehow just > incorrect. My team is running 2.0.0 on both the broker and client, though > from what I can tell from reading the code, the issue continues to exist > through 2.2; at least, I was able to write a failing unit test that I believe > reproduces it. > When a {{KafkaBasedLog}} worker thread in the Connect runtime calls > {{readLogToEnd}} and brokers are unavailable, the {{TimeoutException}} from > the consumer {{endOffsets}} call is uncaught all the way up to the top level > {{catch (Throwable t)}}, effectively killing the thread until restarting > Connect. The result is Connect stops functioning entirely, with no > indication except for that log line - tasks still show as running. > The proposed fix is to simply catch and log the {{TimeoutException}}, > allowing the worker thread to retry forever. > Alternatively, perhaps there is not an expectation that Connect should be > able to recover following broker unavailability, though that would be > disappointing. I would at least hope hope for a louder failure then the > single {{ERROR}} log. -- This message was sent by Atlassian JIRA (v7.6.14#76016)