[
https://issues.apache.org/jira/browse/KAFKA-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
José Armando García Sancio updated KAFKA-15844:
-----------------------------------------------
Labels: zookeeper (was: )
> Broker doesn't re-register after losing ZK session
> --------------------------------------------------
>
> Key: KAFKA-15844
> URL: https://issues.apache.org/jira/browse/KAFKA-15844
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.1.2
> Reporter: José Armando García Sancio
> Priority: Major
> Labels: zookeeper
>
> We experienced a case where a Kafka broker lost connection to the ZK cluster
> and was not able to recreate the registration znode. Only, after the broker
> was restarted did the registration znode get created.
> The interesting observation is that the "ACL authorizer" ZK client identified
> the session lost and recreated the ZK client but the "Kafka server" ZK client
> never received an SessionExpiredException exception.
> Here is an example session where this happened. The controller sees the
> broker go offline:
> {code:java}
> INFO [Controller id=32] Newly added brokers: , deleted brokers: 37, bounced
> brokers: , all live brokers: ...{code}
> "ACL authorizer" ZK session is lost and recreated in broker 37:
> {code:java}
> [Broker=37] WARN Client session timed out, have not heard from server in
> 3026ms for sessionid 0x504b9c08b5e0025
> ...
> INFO [ZooKeeperClient ACL authorizer] Session expired.
> ...
> INFO [ZooKeeperClient ACL authorizer] Initializing a new session to ...
> ...
> [Broker=37] INFO Session establishment complete on server ..., sessionid =
> 0x604dd0ad7180045, negotiated timeout = 18000{code}
> Unfortunately, we never see similar logs for the "Kafka server":
> {code:java}
> WARN Client session timed out, have not heard from server in 14227ms for
> sessionid 0x304beeed4930026 (org.apache.zookeeper.ClientCnxn)
> ...
> INFO Client session timed out, have not heard from server in 14227ms for
> sessionid 0x304beeed4930026, closing socket connection and attempting
> reconnect (org.apache.zookeeper.ClientCnxn)
> ...
> WARN Client session timed out, have not heard from server in 4548ms for
> sessionid 0x304beeed4930026 (org.apache.zookeeper.ClientCnxn)
> ...
> INFO Client session timed out, have not heard from server in 4548ms for
> sessionid 0x304beeed4930026, closing socket connection and attempting
> reconnect (org.apache.zookeeper.ClientCnxn){code}
> Maybe we are running into this issue from the ZOOKEEPER-1159 discussion:
> {quote}As I understand it, the problem here may be that a disconnected client
> cannot discover that its session has expired. Only the server can declare a
> session expired which on the client side leads to the
> SessionExpiredException, but only when the client is connected.
> If this assumption is correct, I'm not sure how best to address it.
> {quote}
>
> Restarting broker 37 resolved the issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)