José Armando García Sancio created KAFKA-15844: --------------------------------------------------
Summary: Broker does re-register Key: KAFKA-15844 URL: https://issues.apache.org/jira/browse/KAFKA-15844 Project: Kafka Issue Type: Bug Affects Versions: 3.1.2 Reporter: José Armando García Sancio We experience a case where a Kafka broker lost connection to the ZK cluster and was not able to recreate the registration znode. Only, after the broker was restarted did the registration znode get created. My impression is that the following code is not correct. This code marks the ZK client as connect right after creating the ZooKeeper client. It doesn't wait for the session state to be marked as connected. {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher, clientConfig) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(RetryBackoffMs) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} During broker startup or construction of the {{{}ZooKeeperClient{}}}, it blocks waiting for the connection state to be marked as connected. The controller sees the broker go offline: {code:java} INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced brokers: , all live brokers: ...{code} ZK session is lost in broker 37: {code:java} [Broker=37] WARN Client session timed out, have not heard from server in 3026ms for sessionid 0x504b9c08b5e0025 ... INFO [ZooKeeperClient ACL authorizer] Session expired. ... INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ... ... [Broker=37] INFO Session establishment complete on server ..., sessionid = 0x604dd0ad7180045, negotiated timeout = 18000{code} Unfortunately, we never see the broker recreate the broker registration znode. We never see the following line in the logs: {code:java} Creating $path (is it secure? $isSecure){code} My best guess is that some of the Kafka threads (for example the controller threads) are block on the ZK client. Unfortunately, I don't have a thread dump of the process at the time of the issue. Restarting broker 37 resolved the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)