Paul Whalen created KAFKA-7941:
----------------------------------
Summary: Connect KafkaBasedLog work thread terminates when getting
offsets fails because broker is unavailable
Key: KAFKA-7941
URL: https://issues.apache.org/jira/browse/KAFKA-7941
Project: Kafka
Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Paul Whalen
Assignee: Paul Whalen
My team has run into this Connect bug regularly in the last six months while
doing infrastructure maintenance that causes intermittent broker availability
issues. I'm a little surprised it exists given how routinely it affects us, so
perhaps someone in the know can point out if our setup is somehow just
incorrect. My team is running 2.0.0 on both the broker and client, though from
what I can tell from reading the code, the issue continues to exist through
2.2; at least, I was able to write a failing unit test that I believe
reproduces it.
When a {{KafkaBasedLog}} worker thread in the Connect runtime calls
{{readLogToEnd}} and brokers are unavailable, the {{TimeoutException}} from the
consumer {{endOffsets}} call is uncaught all the way up to the top level
{{catch (Throwable t)}}, effectively killing the thread until restarting
Connect. The result is Connect stops functioning entirely, with no indication
except for that log line - tasks still show as running.
The proposed fix is to simply catch and log the {{TimeoutException}}, allowing
the worker thread to retry forever.
Alternatively, perhaps there is not an expectation that Connect should be able
to recover following broker unavailability, though that would be disappointing.
I would at least hope hope for a louder failure then the single {{ERROR}} log.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)