Hi all,
Recently we faced an issue with one of our production kafka clusters:
 - It is a 3 node cluster
 - kafka server version is 1.0

*Issue*:
One of the brokers had some problem resulting in the following:
1. The broker lost leadership of all of the topic-parttions
2. However the kafka server process did *NOT* get stopped/killed.

I have attached the exception traces from controller and server logs at
that time at the end of this email.

*Questions:*
1. What could be the reason behind this happening?
2. How can I reproduce this exact scenario in our test environment?

P.S. We did not see any GC logs, or general network blips at around that
time.

*Exception trace in controller log:*

[2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] Starting
(kafka.controller.RequestSendThread)
[2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] Starting
(kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,354] ERROR [Controller id=2] Error while electing or
becoming controller on broker 2 (kafka.controller.KafkaController)
kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either
before or while waiting for connection
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:233)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at
kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:221)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:215)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215)
        at
kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at
kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:214)
        at
kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1461)
        at
kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1437)
        at
kafka.zk.KafkaZkClient.getPartitionReassignment(KafkaZkClient.scala:671)
        at
kafka.controller.KafkaController.initializePartitionReassignment(KafkaController.scala:698)
        at
kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:655)
        at
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:232)
        at
kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1203)
        at
kafka.controller.KafkaController$Reelect$.process(KafkaController.scala:1479)
        at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69)
        at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
        at
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
        at
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2022-05-21 09:07:48,358] DEBUG [Controller id=2] Resigning
(kafka.controller.KafkaController)
[2022-05-21 09:07:48,358] DEBUG [Controller id=2] Unregister
BrokerModifications handler for Set(2, 1) (kafka.controller.KafkaController)
[2022-05-21 09:07:48,360] INFO [PartitionStateMachine controllerId=2]
Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2022-05-21 09:07:48,361] INFO [ReplicaStateMachine controllerId=2] Stopped
replica state machine (kafka.controller.ReplicaStateMachine)
[2022-05-21 09:07:48,361] INFO [RequestSendThread controllerId=2] Shutting
down (kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped
(kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutdown
completed (kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutting
down (kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped
(kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutdown
completed (kafka.controller.RequestSendThread)
[2022-05-21 09:07:48,365] INFO [Controller id=2] Resigned
(kafka.controller.KafkaController)


*Snippets from server log:*

[2022-05-21 09:06:36,678] INFO [Log partition=efAcquisition-18,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 252314842 in 2 ms. (kafka.log.Log)
[2022-05-21 09:06:53,106] INFO [GroupCoordinator 2]: Preparing to rebalance
group
spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0
with old generation 1 (__consumer_offsets-43)
(kafka.coordinator.group.GroupCoordinator)
[2022-05-21 09:06:53,107] INFO [GroupCoordinator 2]: Group
spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0
with generation 2 is now empty (__consumer_offsets-43)
(kafka.coordinator.group.GroupCoordinator)
[2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Preparing to rebalance
group
spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0
with old generation 1 (__consumer_offsets-37)
(kafka.coordinator.group.GroupCoordinator)
[2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Group
spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0
with generation 2 is now empty (__consumer_offsets-37)
(kafka.coordinator.group.GroupCoordinator)
[2022-05-21 09:07:21,154] INFO [ProducerStateManager partition=hh_topic-9]
Writing producer snapshot at offset 7837005056
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:21,155] INFO [Log partition=hh_topic-9,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 7837005056 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:26,088] INFO [ProducerStateManager
partition=efGamePlay-11] Writing producer snapshot at offset 2116578569
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:26,089] INFO [Log partition=efGamePlay-11,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 2116578569 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:27,170] INFO [ProducerStateManager
partition=efGamePlay-18] Writing producer snapshot at offset 2116607621
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:27,171] INFO [Log partition=efGamePlay-18,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 2116607621 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:30,444] INFO [ProducerStateManager
partition=auditTopic-36] Writing producer snapshot at offset 145646657
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:30,445] INFO [Log partition=auditTopic-36,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 145646657 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:33,189] INFO [ProducerStateManager
partition=heartbeats-0] Writing producer snapshot at offset 43292127
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:33,190] INFO [Log partition=heartbeats-0,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 43292127 in 3 ms. (kafka.log.Log)
[2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group
spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0
transitioned to Dead in generation 2
(kafka.coordinator.group.GroupMetadataManager)
[2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group
spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0
transitioned to Dead in generation 2
(kafka.coordinator.group.GroupMetadataManager)
[2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Removed 0
expired offsets in 0 milliseconds.
(kafka.coordinator.group.GroupMetadataManager)
[2022-05-21 09:07:37,543] INFO [ProducerStateManager
partition=auditTopic-0] Writing producer snapshot at offset 145649350
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:37,544] INFO [Log partition=auditTopic-0,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 145649350 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:39,207] INFO [ProducerStateManager
partition=taxonomyOthers-48] Writing producer snapshot at offset 162628223
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:39,207] INFO [Log partition=taxonomyOthers-48,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 162628223 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:40,103] INFO Creating /controller (is it secure? false)
(kafka.zk.KafkaZkClient)
[2022-05-21 09:07:40,115] INFO Result of znode creation at /controller is:
OK (kafka.zk.KafkaZkClient)
[2022-05-21 09:07:45,915] WARN Client session timed out, have not heard
from server in 5138ms for sessionid 0x27b85a3ead1005a
(org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:45,916] INFO Client session timed out, have not heard
from server in 5138ms for sessionid 0x27b85a3ead1005a, closing socket
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:45,933] WARN Attempting to send response via channel for
which there is no open connection, connection id 10.200.3.174:9092
-10.200.2.18:33360-5632853 (kafka.network.Processor)
[2022-05-21 09:07:45,936] WARN Attempting to send response via channel for
which there is no open connection, connection id 10.200.3.174:9092
-10.200.3.29:18284-5535253 (kafka.network.Processor)
[2022-05-21 09:07:45,995] WARN Attempting to send response via channel for
which there is no open connection, connection id 10.200.3.174:9092
-10.200.4.128:6078-5535526 (kafka.network.Processor)
[2022-05-21 09:07:45,996] WARN Attempting to send response via channel for
which there is no open connection, connection id 10.200.3.174:9092
-10.200.4.129:6664-5535480 (kafka.network.Processor)
[2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Waiting until connected.
(kafka.zookeeper.ZooKeeperClient)
[2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected.
(kafka.zookeeper.ZooKeeperClient)
[2022-05-21 09:07:46,902] INFO Opening socket connection to server 10.200/
10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown
error) (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,027] WARN Client session timed out, have not heard
from server in 2001ms for sessionid 0x27b85a3ead1005a
(org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,027] INFO Client session timed out, have not heard
from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected.
(kafka.zookeeper.ZooKeeperClient)
[2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected.
(kafka.zookeeper.ZooKeeperClient)
[2022-05-21 09:07:46,902] INFO Opening socket connection to server 10.200/
10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown
error) (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,027] WARN Client session timed out, have not heard
from server in 2001ms for sessionid 0x27b85a3ead1005a
(org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,027] INFO Client session timed out, have not heard
from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected.
(kafka.zookeeper.ZooKeeperClient)
[2022-05-21 09:07:48,294] INFO Opening socket connection to server
10.200.3.174/10.200.3.174:2181. Will not attempt to authenticate using SASL
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,294] INFO Socket connection established to
10.200.3.174/10.200.3.174:2181, initiating session
(org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,313] WARN Unable to reconnect to ZooKeeper service,
session 0x27b85a3ead1005a has expired (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,314] INFO Unable to reconnect to ZooKeeper service,
session 0x27b85a3ead1005a has expired, closing socket connection
(org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,314] INFO EventThread shut down for session:
0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn)
[2022-05-21 09:07:48,318] INFO [ZooKeeperClient] Session expired.
(kafka.zookeeper.ZooKeeperClient)
[2022-05-21 09:07:48,396] INFO [ProducerStateManager
partition=efGamePlay-41] Writing producer snapshot at offset 2116645612
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:48,397] INFO [Log partition=efGamePlay-41,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 2116645612 in 2 ms. (kafka.log.Log)
[2022-05-21 09:07:49,452] INFO [ProducerStateManager
partition=efAcquisition-47] Writing producer snapshot at offset 252224167
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:49,453] INFO [Log partition=efAcquisition-47,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 252224167 in 3 ms. (kafka.log.Log)
[2022-05-21 09:07:57,475] INFO [ProducerStateManager
partition=efGamePlay-5] Writing producer snapshot at offset 2116617455
(kafka.log.ProducerStateManager)
[2022-05-21 09:07:57,476] INFO [Log partition=efGamePlay-5,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at
offset 2116617455 in 2 ms. (kafka.log.Log)
[2022-05-21 09:08:00,642] INFO [Partition prism_sqldb_topic-55 broker=2]
Shrinking ISR from 2,1 to 2 (kafka.cluster.Partition)
[2022-05-21 09:08:00,782] WARN Attempting to send response via channel for
which there is no open connection, connection id 10.200.3.174:9092
-10.200.2.192:52088-5632843 (kafka.network.Processor)
[2022-05-21 09:08:05,321] INFO [Log partition=efGamePlay-41,
dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Found deletable segments
with base offsets [2116567157] due to retention time 3600000ms breach
(kafka.log.Log)

Reply via email to