José Armando García Sancio created KAFKA-15844:
--------------------------------------------------
Summary: Broker does re-register
Key: KAFKA-15844
URL: https://issues.apache.org/jira/browse/KAFKA-15844
Project: Kafka
Issue Type: Bug
Affects Versions: 3.1.2
Reporter: José Armando García Sancio
We experience a case where a Kafka broker lost connection to the ZK cluster and
was not able to recreate the registration znode. Only, after the broker was
restarted did the registration znode get created.
My impression is that the following code is not correct. This code marks the ZK
client as connect right after creating the ZooKeeper client. It doesn't wait
for the session state to be marked as connected.
{code:java}
private def reinitialize(): Unit = {
// Initialization callbacks are invoked outside of the lock to avoid
deadlock potential since their completion
// may require additional Zookeeper requests, which will block to acquire
the initialization lock
stateChangeHandlers.values.foreach(callBeforeInitializingSession _)
inWriteLock(initializationLock) {
if (!connectionState.isAlive) {
zooKeeper.close()
info(s"Initializing a new session to $connectString.")
// retry forever until ZooKeeper can be instantiated
var connected = false
while (!connected) {
try {
zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs,
ZooKeeperClientWatcher, clientConfig)
connected = true
} catch {
case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short
sleep", e)
Thread.sleep(RetryBackoffMs)
}
}
}
} stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
During broker startup or construction of the {{{}ZooKeeperClient{}}}, it blocks
waiting for the connection state to be marked as connected.
The controller sees the broker go offline:
{code:java}
INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced
brokers: , all live brokers: ...{code}
ZK session is lost in broker 37:
{code:java}
[Broker=37] WARN Client session timed out, have not heard from server in 3026ms
for sessionid 0x504b9c08b5e0025
...
INFO [ZooKeeperClient ACL authorizer] Session expired.
...
INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
...
[Broker=37] INFO Session establishment complete on server ..., sessionid =
0x604dd0ad7180045, negotiated timeout = 18000{code}
Unfortunately, we never see the broker recreate the broker registration znode.
We never see the following line in the logs:
{code:java}
Creating $path (is it secure? $isSecure){code}
My best guess is that some of the Kafka threads (for example the controller
threads) are block on the ZK client. Unfortunately, I don't have a thread dump
of the process at the time of the issue.
Restarting broker 37 resolved the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)