José Armando García Sancio created KAFKA-15844:
--------------------------------------------------

             Summary: Broker does re-register
                 Key: KAFKA-15844
                 URL: https://issues.apache.org/jira/browse/KAFKA-15844
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.1.2
            Reporter: José Armando García Sancio


We experience a case where a Kafka broker lost connection to the ZK cluster and 
was not able to recreate the registration znode. Only, after the broker was 
restarted did the registration znode get created.

My impression is that the following code is not correct. This code marks the ZK 
client as connect right after creating the ZooKeeper client. It doesn't wait 
for the session state to be marked as connected.
{code:java}
     private def reinitialize(): Unit = {
      // Initialization callbacks are invoked outside of the lock to avoid 
deadlock potential since their completion
      // may require additional Zookeeper requests, which will block to acquire 
the initialization lock
      stateChangeHandlers.values.foreach(callBeforeInitializingSession _)      
inWriteLock(initializationLock) {
        if (!connectionState.isAlive) {
          zooKeeper.close()
          info(s"Initializing a new session to $connectString.")
          // retry forever until ZooKeeper can be instantiated
          var connected = false
          while (!connected) {
            try {
              zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher, clientConfig)
              connected = true
            } catch {
              case e: Exception =>
                info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
                Thread.sleep(RetryBackoffMs)
            }
          }
        }
      }      stateChangeHandlers.values.foreach(callAfterInitializingSession _)
    }
{code}
During broker startup or construction of the {{{}ZooKeeperClient{}}}, it blocks 
waiting for the connection state to be marked as connected.

The controller sees the broker go offline:
{code:java}
INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced 
brokers: , all live brokers: ...{code}
ZK session is lost in broker 37:
{code:java}
[Broker=37] WARN Client session timed out, have not heard from server in 3026ms 
for sessionid 0x504b9c08b5e0025
...
INFO [ZooKeeperClient ACL authorizer] Session expired.
...
INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
...
[Broker=37] INFO Session establishment complete on server ..., sessionid = 
0x604dd0ad7180045, negotiated timeout = 18000{code}
Unfortunately, we never see the broker recreate the broker registration znode. 
We never see the following line in the logs:
{code:java}
Creating $path (is it secure? $isSecure){code}
My best guess is that some of the Kafka threads (for example the controller 
threads) are block on the ZK client. Unfortunately, I don't have a thread dump 
of the process at the time of the issue.

Restarting broker 37 resolved the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to