[ 
https://issues.apache.org/jira/browse/KAFKA-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Armando García Sancio updated KAFKA-15844:
-----------------------------------------------
    Description: 
We experienced a case where a Kafka broker lost connection to the ZK cluster 
and was not able to recreate the registration znode. Only, after the broker was 
restarted did the registration znode get created.

The interesting observation is that the "ACL authorizer" ZK client identified 
the session lost and recreated the ZK client but the "Kafka server" ZK client 
never received an SessionExpiredException exception.

Here is an example session where this happened. The controller sees the broker 
go offline:
{code:java}
INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced 
brokers: , all live brokers: ...{code}
"ACL authorizer" ZK session is lost and recreated in broker 37:
{code:java}
[Broker=37] WARN Client session timed out, have not heard from server in 3026ms 
for sessionid 0x504b9c08b5e0025
...
INFO [ZooKeeperClient ACL authorizer] Session expired.
...
INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
...
[Broker=37] INFO Session establishment complete on server ..., sessionid = 
0x604dd0ad7180045, negotiated timeout = 18000{code}
Unfortunately, we never see similar logs for the "Kafka server":
{code:java}
WARN Client session timed out, have not heard from server in 14227ms for 
sessionid 0x304beeed4930026 (org.apache.zookeeper.ClientCnxn)
...
INFO Client session timed out, have not heard from server in 14227ms for 
sessionid 0x304beeed4930026, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn)
...
WARN Client session timed out, have not heard from server in 4548ms for 
sessionid 0x304beeed4930026 (org.apache.zookeeper.ClientCnxn)
...
INFO Client session timed out, have not heard from server in 4548ms for 
sessionid 0x304beeed4930026, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn){code}
Maybe we are running into this issue from the ZOOKEEPER-1159 discussion:
{quote}As I understand it, the problem here may be that a disconnected client 
cannot discover that its session has expired. Only the server can declare a 
session expired which on the client side leads to the SessionExpiredException, 
but only when the client is connected.
If this assumption is correct, I'm not sure how best to address it.
{quote}
 

Restarting broker 37 resolved the issue.

  was:
We experienced a case where a Kafka broker lost connection to the ZK cluster 
and was not able to recreate the registration znode. Only, after the broker was 
restarted did the registration znode get created.

My impression is that the following code is not correct. This code assumes that 
the ZK client as connect right after creating the ZooKeeper client. It doesn't 
wait for the session state to be marked as connected.
{code:java}
     private def reinitialize(): Unit = {
      // Initialization callbacks are invoked outside of the lock to avoid 
deadlock potential since their completion
      // may require additional Zookeeper requests, which will block to acquire 
the initialization lock
      stateChangeHandlers.values.foreach(callBeforeInitializingSession _)
      inWriteLock(initializationLock) {
        if (!connectionState.isAlive) {
          zooKeeper.close()
          info(s"Initializing a new session to $connectString.")
          // retry forever until ZooKeeper can be instantiated
          var connected = false
          while (!connected) {
            try {
              zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher, clientConfig)
              connected = true
            } catch {
              case e: Exception =>
                info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
                Thread.sleep(RetryBackoffMs)
            }
          }
        }
      }
      stateChangeHandlers.values.foreach(callAfterInitializingSession _)
    }
{code}
During broker startup or construction of the {{{}ZooKeeperClient{}}}, it blocks 
waiting for the connection state to be marked as connected.

Here is an example session where this happened. The controller sees the broker 
go offline:
{code:java}
INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced 
brokers: , all live brokers: ...{code}
ZK session is lost in broker 37:
{code:java}
[Broker=37] WARN Client session timed out, have not heard from server in 3026ms 
for sessionid 0x504b9c08b5e0025
...
INFO [ZooKeeperClient ACL authorizer] Session expired.
...
INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
...
[Broker=37] INFO Session establishment complete on server ..., sessionid = 
0x604dd0ad7180045, negotiated timeout = 18000{code}
Unfortunately, we never see the broker recreate the broker registration znode. 
We never see the following line in the logs:
{code:java}
Creating $path (is it secure? $isSecure){code}
My best guess is that some of the Kafka threads (for example the controller 
threads) are block on the ZK client. Unfortunately, I don't have a thread dump 
of the process at the time of the issue.

Restarting broker 37 resolved the issue.


> Broker doesn't re-register after losing ZK session
> --------------------------------------------------
>
>                 Key: KAFKA-15844
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15844
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: José Armando García Sancio
>            Priority: Major
>
> We experienced a case where a Kafka broker lost connection to the ZK cluster 
> and was not able to recreate the registration znode. Only, after the broker 
> was restarted did the registration znode get created.
> The interesting observation is that the "ACL authorizer" ZK client identified 
> the session lost and recreated the ZK client but the "Kafka server" ZK client 
> never received an SessionExpiredException exception.
> Here is an example session where this happened. The controller sees the 
> broker go offline:
> {code:java}
> INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced 
> brokers: , all live brokers: ...{code}
> "ACL authorizer" ZK session is lost and recreated in broker 37:
> {code:java}
> [Broker=37] WARN Client session timed out, have not heard from server in 
> 3026ms for sessionid 0x504b9c08b5e0025
> ...
> INFO [ZooKeeperClient ACL authorizer] Session expired.
> ...
> INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
> ...
> [Broker=37] INFO Session establishment complete on server ..., sessionid = 
> 0x604dd0ad7180045, negotiated timeout = 18000{code}
> Unfortunately, we never see similar logs for the "Kafka server":
> {code:java}
> WARN Client session timed out, have not heard from server in 14227ms for 
> sessionid 0x304beeed4930026 (org.apache.zookeeper.ClientCnxn)
> ...
> INFO Client session timed out, have not heard from server in 14227ms for 
> sessionid 0x304beeed4930026, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn)
> ...
> WARN Client session timed out, have not heard from server in 4548ms for 
> sessionid 0x304beeed4930026 (org.apache.zookeeper.ClientCnxn)
> ...
> INFO Client session timed out, have not heard from server in 4548ms for 
> sessionid 0x304beeed4930026, closing socket connection and attempting 
> reconnect (org.apache.zookeeper.ClientCnxn){code}
> Maybe we are running into this issue from the ZOOKEEPER-1159 discussion:
> {quote}As I understand it, the problem here may be that a disconnected client 
> cannot discover that its session has expired. Only the server can declare a 
> session expired which on the client side leads to the 
> SessionExpiredException, but only when the client is connected.
> If this assumption is correct, I'm not sure how best to address it.
> {quote}
>  
> Restarting broker 37 resolved the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to