Hi all, Recently we faced an issue with one of our production kafka clusters: - It is a 3 node cluster - kafka server version is 1.0
*Issue*: One of the brokers had some problem resulting in the following: 1. The broker lost leadership of all of the topic-parttions 2. However the kafka server process did *NOT* get stopped/killed. I have attached the exception traces from controller and server logs at that time at the end of this email. *Questions:* 1. What could be the reason behind this happening? 2. How can I reproduce this exact scenario in our test environment? P.S. We did not see any GC logs, or general network blips at around that time. *Exception trace in controller log:* [2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] Starting (kafka.controller.RequestSendThread) [2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] Starting (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,354] ERROR [Controller id=2] Error while electing or becoming controller on broker 2 (kafka.controller.KafkaController) kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either before or while waiting for connection at kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:233) at kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221) at kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) at kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:221) at kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:215) at kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215) at kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) at kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:214) at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1461) at kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1437) at kafka.zk.KafkaZkClient.getPartitionReassignment(KafkaZkClient.scala:671) at kafka.controller.KafkaController.initializePartitionReassignment(KafkaController.scala:698) at kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:655) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:232) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1203) at kafka.controller.KafkaController$Reelect$.process(KafkaController.scala:1479) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2022-05-21 09:07:48,358] DEBUG [Controller id=2] Resigning (kafka.controller.KafkaController) [2022-05-21 09:07:48,358] DEBUG [Controller id=2] Unregister BrokerModifications handler for Set(2, 1) (kafka.controller.KafkaController) [2022-05-21 09:07:48,360] INFO [PartitionStateMachine controllerId=2] Stopped partition state machine (kafka.controller.PartitionStateMachine) [2022-05-21 09:07:48,361] INFO [ReplicaStateMachine controllerId=2] Stopped replica state machine (kafka.controller.ReplicaStateMachine) [2022-05-21 09:07:48,361] INFO [RequestSendThread controllerId=2] Shutting down (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutdown completed (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutting down (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutdown completed (kafka.controller.RequestSendThread) [2022-05-21 09:07:48,365] INFO [Controller id=2] Resigned (kafka.controller.KafkaController) *Snippets from server log:* [2022-05-21 09:06:36,678] INFO [Log partition=efAcquisition-18, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 252314842 in 2 ms. (kafka.log.Log) [2022-05-21 09:06:53,106] INFO [GroupCoordinator 2]: Preparing to rebalance group spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 with old generation 1 (__consumer_offsets-43) (kafka.coordinator.group.GroupCoordinator) [2022-05-21 09:06:53,107] INFO [GroupCoordinator 2]: Group spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 with generation 2 is now empty (__consumer_offsets-43) (kafka.coordinator.group.GroupCoordinator) [2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Preparing to rebalance group spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 with old generation 1 (__consumer_offsets-37) (kafka.coordinator.group.GroupCoordinator) [2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Group spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 with generation 2 is now empty (__consumer_offsets-37) (kafka.coordinator.group.GroupCoordinator) [2022-05-21 09:07:21,154] INFO [ProducerStateManager partition=hh_topic-9] Writing producer snapshot at offset 7837005056 (kafka.log.ProducerStateManager) [2022-05-21 09:07:21,155] INFO [Log partition=hh_topic-9, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 7837005056 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:26,088] INFO [ProducerStateManager partition=efGamePlay-11] Writing producer snapshot at offset 2116578569 (kafka.log.ProducerStateManager) [2022-05-21 09:07:26,089] INFO [Log partition=efGamePlay-11, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 2116578569 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:27,170] INFO [ProducerStateManager partition=efGamePlay-18] Writing producer snapshot at offset 2116607621 (kafka.log.ProducerStateManager) [2022-05-21 09:07:27,171] INFO [Log partition=efGamePlay-18, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 2116607621 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:30,444] INFO [ProducerStateManager partition=auditTopic-36] Writing producer snapshot at offset 145646657 (kafka.log.ProducerStateManager) [2022-05-21 09:07:30,445] INFO [Log partition=auditTopic-36, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 145646657 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:33,189] INFO [ProducerStateManager partition=heartbeats-0] Writing producer snapshot at offset 43292127 (kafka.log.ProducerStateManager) [2022-05-21 09:07:33,190] INFO [Log partition=heartbeats-0, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 43292127 in 3 ms. (kafka.log.Log) [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager) [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 transitioned to Dead in generation 2 (kafka.coordinator.group.GroupMetadataManager) [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) [2022-05-21 09:07:37,543] INFO [ProducerStateManager partition=auditTopic-0] Writing producer snapshot at offset 145649350 (kafka.log.ProducerStateManager) [2022-05-21 09:07:37,544] INFO [Log partition=auditTopic-0, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 145649350 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:39,207] INFO [ProducerStateManager partition=taxonomyOthers-48] Writing producer snapshot at offset 162628223 (kafka.log.ProducerStateManager) [2022-05-21 09:07:39,207] INFO [Log partition=taxonomyOthers-48, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 162628223 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:40,103] INFO Creating /controller (is it secure? false) (kafka.zk.KafkaZkClient) [2022-05-21 09:07:40,115] INFO Result of znode creation at /controller is: OK (kafka.zk.KafkaZkClient) [2022-05-21 09:07:45,915] WARN Client session timed out, have not heard from server in 5138ms for sessionid 0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:45,916] INFO Client session timed out, have not heard from server in 5138ms for sessionid 0x27b85a3ead1005a, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:45,933] WARN Attempting to send response via channel for which there is no open connection, connection id 10.200.3.174:9092 -10.200.2.18:33360-5632853 (kafka.network.Processor) [2022-05-21 09:07:45,936] WARN Attempting to send response via channel for which there is no open connection, connection id 10.200.3.174:9092 -10.200.3.29:18284-5535253 (kafka.network.Processor) [2022-05-21 09:07:45,995] WARN Attempting to send response via channel for which there is no open connection, connection id 10.200.3.174:9092 -10.200.4.128:6078-5535526 (kafka.network.Processor) [2022-05-21 09:07:45,996] WARN Attempting to send response via channel for which there is no open connection, connection id 10.200.3.174:9092 -10.200.4.129:6664-5535480 (kafka.network.Processor) [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient) [2022-05-21 09:07:46,902] INFO Opening socket connection to server 10.200/ 10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,027] WARN Client session timed out, have not heard from server in 2001ms for sessionid 0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,027] INFO Client session timed out, have not heard from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected. (kafka.zookeeper.ZooKeeperClient) [2022-05-21 09:07:46,902] INFO Opening socket connection to server 10.200/ 10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,027] WARN Client session timed out, have not heard from server in 2001ms for sessionid 0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,027] INFO Client session timed out, have not heard from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [2022-05-21 09:07:48,294] INFO Opening socket connection to server 10.200.3.174/10.200.3.174:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,294] INFO Socket connection established to 10.200.3.174/10.200.3.174:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,313] WARN Unable to reconnect to ZooKeeper service, session 0x27b85a3ead1005a has expired (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,314] INFO Unable to reconnect to ZooKeeper service, session 0x27b85a3ead1005a has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,314] INFO EventThread shut down for session: 0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn) [2022-05-21 09:07:48,318] INFO [ZooKeeperClient] Session expired. (kafka.zookeeper.ZooKeeperClient) [2022-05-21 09:07:48,396] INFO [ProducerStateManager partition=efGamePlay-41] Writing producer snapshot at offset 2116645612 (kafka.log.ProducerStateManager) [2022-05-21 09:07:48,397] INFO [Log partition=efGamePlay-41, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 2116645612 in 2 ms. (kafka.log.Log) [2022-05-21 09:07:49,452] INFO [ProducerStateManager partition=efAcquisition-47] Writing producer snapshot at offset 252224167 (kafka.log.ProducerStateManager) [2022-05-21 09:07:49,453] INFO [Log partition=efAcquisition-47, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 252224167 in 3 ms. (kafka.log.Log) [2022-05-21 09:07:57,475] INFO [ProducerStateManager partition=efGamePlay-5] Writing producer snapshot at offset 2116617455 (kafka.log.ProducerStateManager) [2022-05-21 09:07:57,476] INFO [Log partition=efGamePlay-5, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at offset 2116617455 in 2 ms. (kafka.log.Log) [2022-05-21 09:08:00,642] INFO [Partition prism_sqldb_topic-55 broker=2] Shrinking ISR from 2,1 to 2 (kafka.cluster.Partition) [2022-05-21 09:08:00,782] WARN Attempting to send response via channel for which there is no open connection, connection id 10.200.3.174:9092 -10.200.2.192:52088-5632843 (kafka.network.Processor) [2022-05-21 09:08:05,321] INFO [Log partition=efGamePlay-41, dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Found deletable segments with base offsets [2116567157] due to retention time 3600000ms breach (kafka.log.Log)