[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570326#comment-17570326 ] Haifeng Chen commented on KAFKA-2729: - We saw this issue in 1.1 during kafka reconnects to zookeeper. It caused under minISR and got recovered in 2 minutes. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483450#comment-17483450 ] Yiming Zang commented on KAFKA-2729: We are still seeing this issue for 2.7.0, I'm not sure if this issue is resolved or not. When this happens, we got partitions under minISR and produce requests starts to fail. This is often triggered when a single broker was restarted. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472207#comment-17472207 ] Jun Rao commented on KAFKA-2729: [~mgabriel] : The BadVersion on ZK server just indicates that a conditional update has failed. It's the result and not the cause. To understand the cause, we need to know if the controller changed the metadata for partition topicXYZ-1 before the time when Cached zkVersion is reported. You can grep the state-change log in the controller to find that out. If the controller didn't make the change, you can parse the ZK commit log to see which client updated the partition metadata. If the controller did make the change, you can then look at the state-change log in node 1 to see if it has received the latest partition metadata from the controller. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472112#comment-17472112 ] Matthias Gabriel commented on KAFKA-2729: - Hey [~junrao], We also have the same issue recurring once a week in version 1.1.0, which is marked as the "Fix version". We run a cluster with 3 Kafka Brokers: Node-1 {code:java} [2021-12-31 19:12:23,540] INFO [Partition topicXYZ-1 broker=1] Shrinking ISR from 5,3,1 to 5,1 (kafka.cluster.Partition) [2021-12-31 19:12:23,544] INFO [Partition topicXYZ-1 broker=1] Cached zkVersion [326] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition){code} On Node-2 we do not see any related message for the timeperiod On Node-3 we have the following message, which we are not sure if its related at all. {code:java} 2021-12-31 19:12:23,541 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@653] - Got user-level KeeperException when processing sessionid:0x1004521e6ed type:setData cxid:0xbca4 zxid:0x3a4a372 txntype:-1 reqpath:n/a Error Path:/brokers/topics/topicXYZ/partitions/1/state Error:KeeperErrorCode = BadVersion for /brokers/topics/topicXYZ/partitions/1/state{code} Do you have any idea what we could do or which data we could deliver to give you additional insights? Thanks Matthias > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396210#comment-17396210 ] Jun Rao commented on KAFKA-2729: [~axrj]: If broker 5's ZK session expires and gets re-established, its broker epoch will change. So, it's possible for broker 5 to receive and reject a LeaderAndIsr request from the controller temporarily. The question is whether broker 5 eventually receives the LeaderAndIsr request when the controller has detected the new broker registration. This can be verified from the controller and the state-change log. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389844#comment-17389844 ] Raj commented on KAFKA-2729: Hi [~junrao] , This was just hit in our production as well although I was able to resolve it by only restarting the broker that reported errors as opposed to the controller or the whole cluster. Kafka version : 2.3.1 I can confirm the events are identical to what [~l0co] explained above. * ZK session disconnected on broker 5 * Replica Fetchers stopped on other brokers * ZK Connection re-established on broker 5 after a few seconds * Broker 5 came back online and started reporting the "Cached zkVersion[130] not equal to..." and shrunk ISRs to only itself As it didn't recover automatically, I restarted the broker after 30 minutes and it then went back to normal. I did see that the controller tried to send correct metadata to broker 5 but which was rejected due to epoch inconsistency. {noformat} ERROR [KafkaApi-5] Error when handling request: clientId=21, correlationId=2, api=UPDATE_METADATA, body={controller_id=21,controller_epoch=53,broker_epoch=223338313060,topic_states=[{topic-a,partition_states=[{partition=0,controller_epoch=53,leader=25,leader_epoch=70,isr=[25,17],zk_version=131,replicas=[5,25,17],offline_replicas=[]}... ... java.lang.IllegalStateException: Epoch 223338313060 larger than current broker epoch 223338311791 at kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2612) at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:194) at kafka.server.KafkaApis.handle(KafkaApis.scala:117) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69) at java.base/java.lang.Thread.run(Thread.java:834) ... ... ... [2021-07-29 11:07:30,210] INFO [Partition topic-a-0 broker=5] Cached zkVersion [130] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ... {noformat} Preferred leader election error as seen on controller {noformat} [2021-07-29 11:11:57,432] ERROR [Controller id=21] Error completing preferred replica leader election for partition topic-a-0 (kafka.controller.KafkaController) kafka.common.StateChangeFailedException: Failed to elect leader for partition topic-a-0 under strategy PreferredReplicaPartitionLeaderElectionStrategy at kafka.controller.ZkPartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:381) at kafka.controller.ZkPartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:378) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:378) at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:305) at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:215) at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145) at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:646) at kafka.controller.KafkaController$$anonfun$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:995) at kafka.controller.KafkaController$$anonfun$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:976) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at kafka.controller.KafkaController.checkAndTriggerAutoLeaderRebalance(KafkaController.scala:976) at kafka.controller.KafkaController.processAutoPreferredReplicaLeaderElection(KafkaController.scala:1004) at kafka.controller.KafkaController.process(KafkaController.scala:1564) at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:137) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:137) at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:137) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:136) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89){noformat} After the restart of broker-5, it was able to take back leadership of the desired partitions Kindly let me know if
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372214#comment-17372214 ] Jun Rao commented on KAFKA-2729: [~l0co] the leaderEpoch doesn't always match the zkVersion. For example, when the leader expands/shrinks ISR, it changes the zkVersion, but not the leaderEpoch. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371202#comment-17371202 ] l0co commented on KAFKA-2729: - [~junrao] thanks for the reply. Unfortunately from preserved logs from this breakdown I only have this useful: {code:java} [2021-06-22 14:06:50,637] INFO 1/kafka0/server.log.2021-06-22-14: [Partition __consumer_offsets-30 broker=0] __consumer_offsets-30 starts at Leader Epoch 117 from offset 2612283. Previous Leader Epoch was: 116 (kafka.cluster.Partition) [2021-06-22 14:07:04,184] INFO 1/kafka1/server.log.2021-06-22-14: [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 (kafka.cluster.Partition) [2021-06-22 14:07:04,186] INFO 1/kafka1/server.log.2021-06-22-14: [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2021-06-22 14:07:09,146] INFO 1/kafka1/server.log.2021-06-22-14: [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 (kafka.cluster.Partition) [2021-06-22 14:07:09,147] INFO 1/kafka1/server.log.2021-06-22-14: [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} After the zookeeper reconnection in kafka0, kafka0 becomes the leader with epoch 117, and then kafka1 starts to complain that cached zkVersion is not 212, which is a greater number. What does it mean for you? We suspect that zookeeper of kafka0 has been disconnected from kafka1 and kafka2 zookeepers and established its own separate cluster, and then after all zookeepers got back into one cluster, it became inconsistent. Does it make sense for you? > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370724#comment-17370724 ] Jun Rao commented on KAFKA-2729: [~l0co], thanks for reporting this. The "Cached zkVersion [212]" error indicates the leader epoch was changed by the controller but somehow wasn't propagated to the broker. Could you grep for "Partition __consumer_offsets-30" in the controller and state-change log and see which controller changed the leader epoch corresponding to zk version 212 and whether the controller tried to propagate that info to the brokers? > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368661#comment-17368661 ] l0co commented on KAFKA-2729: - This problem is certainly not fixed in `1.1.0` as we still experience it with this Kafka version. This ticket should be reopened, unless the problem is being resolved elsewhere (KAFKA-3042, KAFKA-7888?). Our scenario is the following: we have `kafka0`, `kafka1` and `kafka2` nodes. 1. `kafka0` loses zookeper connection {code:java} WARN Unable to reconnect to ZooKeeper service, session 0x27a31276f6d has expired (org.apache.zookeeper.ClientCnxn) INFO Unable to reconnect to ZooKeeper service, session 0x27a31276f6d has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) INFO EventThread shut down for session: 0x27a31276f6d (org.apache.zookeeper.ClientCnxn) {code} 2. However, a second later the connection is established properly: {code:java} [ZooKeeperClient] Initializing a new session to [...] (kafka.zookeeper.ZooKeeperClient) [2021-06-22 14:06:47,838] INFO Opening socket connection to server [...]. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2021-06-22 14:06:47,873] INFO Socket connection established to [...], initiating session (org.apache.zookeeper.ClientCnxn) [2021-06-22 14:06:47,933] INFO Creating /brokers/ids/0 (is it secure? false) (kafka.zk.KafkaZkClient) [2021-06-22 14:06:47,959] INFO Session establishment complete on server [...], sessionid = 0x27a31276f6d0003, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) {code} 3. But a few seconds later `ReplicaFetcherThread` is shut down in `kafka0`: {code:java} INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread) {code} We suppose this shutdown is the source of the problem. 4. Now, because of no replication requests from `kafka0` to `kafka1` and `kafka2`, `kafka1` and `kafka2` shink ISR list and start to complain about zkVersion. {code:java} INFO [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 (kafka.cluster.Partition) INFO [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} This happens forever, until the whole cluster is restarted. Note, that cluster state is inconsistent now because `kafka0` stops to be a replica for `kafka1` and `kafka2`, but `kafka1` and `kafka2` are still working as replicas for `kafka0`. This is due to `ReplicationFetcherThread` has only been stopped in `kafka0`. 5. Finally, the whole kafka cluster doesn't work and stops processing events, at least for partitions leaded by `kafka0` because of: {code:java} ERROR [ReplicaManager broker=0] Error processing append operation on partition __consumer_offsets-18 (kafka.server.ReplicaManager) org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync replicas for partition __consumer_offsets-18 is [1], below required minimum [2] {code} We also suspect that in this scenario `kafka0` becomes a leader for all partitions, but this is not confirmed yet. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66]
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259314#comment-17259314 ] Jun Rao commented on KAFKA-2729: If you still see this issue, it would be useful to confirm the following. # Is "Cached zkVersion" for the same partition long lasting? Transient "Cached zkVersion" can happen and is ok. # Does the zkVersion for the partition state in ZK match that of the latest recorded zkVersion in the controller (logged in the state-change log)? If not, this indicates a potential problem in ZK. # Otherwise, did the broker receive the latest zkVersion in the leaderAndIsr request from the controller (logged in the state-change log)? > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257448#comment-17257448 ] Victor Garcia commented on KAFKA-2729: -- Yes, it seems this issue is not yet fixed. This should be reopened. We just had this problem with version 1.1.0 > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216323#comment-17216323 ] M. Manna commented on KAFKA-2729: - This has resurfaced for us in production environment yesterday and caused an outage. Has anyone else seen this issue recently? We are using Confluent 2.4.1, but without any customisation. It'd be good to know if there are any steps to reproduce this successfully. The above mentioned test (network stretch or switching) is quite difficult for us to run at the moment. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Critical > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826946#comment-16826946 ] Shawn YUAN commented on KAFKA-2729: --- Thank you [~DEvil], I'm searching for if any git commits diff patch for fixing. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823858#comment-16823858 ] Alexander Binzberger commented on KAFKA-2729: - [~evildracula] In an older version of kafka it was reproducable with a very high network load. Possibly when switches/network is at its limits. Maybe in combination with dropped packets. Try to add delay or load to the network and machines, maybe drop some amount of packets. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818083#comment-16818083 ] evildracula commented on KAFKA-2729: Hello [~junrao], I'm now using 0.11.0.3 which is in affected versions. I would like to reproduce this issue in my DEV environment. Could you please help to provide reproduce steps? Many thanks. I'm now reproducing by **systemctl start/stop iptables** but failed. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720819#comment-16720819 ] Jun Rao commented on KAFKA-2729: [~murri71], my earlier comment was referring to that we fixed the following issue in KAFKA-7165. It's possible that issue may suffice "Cached zkVersion" in some scenarios. I am not sure if there are other issues that can still lead to "Cached zkVersion". So, my recommendation is to upgrade to 2.2 when it's released and file a separate Jira if the issue still exists. [2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, node already exists and owner '3703712903740981258' does not match current session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral) [2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, node already exists and owner '3703712903740981258' does not match current session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral) [2018-10-05 21:08:28,025] INFO Result of znode creation at /controller is: NODEEXISTS (kafka.zk.KafkaZkClient) > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719147#comment-16719147 ] Jœl Salmerón Viver commented on KAFKA-2729: --- [~junrao], this issue is not yet fixed it seems. We, as others here, are experiencing the same loop replication of partitions when trying to delete topics via the bin/kakfa-topics command using 1.1 brokers. If fixed as you say, could someone update the exact broker version where it is fixed? If am torn to upgrade to 2.2 on the brokers, as this bug report does not reflect what you imply by your "We fixed another issue" comment above. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709539#comment-16709539 ] Jun Rao commented on KAFKA-2729: We fixed another issue that can fail the re-creation of the broker registration in ZK in KAFKA-7165 in 2.2.0. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708393#comment-16708393 ] Bo Wang commented on KAFKA-2729: I also got the same problem with 1.1.0, whether there is a patch or version that solves this problem. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648157#comment-16648157 ] adam keyser commented on KAFKA-2729: Still seeing it here as well. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640382#comment-16640382 ] Luigi Tagliamonte commented on KAFKA-2729: -- This issue seems not fixed in 1.1 Cluster details: * 3 Kafka nodes cluster running 1.1 * 3 Zookeeper node cluster running 3.4.10 Today meanwhile I was replacing a zookeeper server the leader among the brokers experienced this issue: {code:java} [2018-10-05 21:03:02,799] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) [2018-10-05 21:08:20,060] INFO Unable to read additional data from server sessionid 0x34663b434985000e, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2018-10-05 21:08:21,001] INFO Opening socket connection to server 10.48.208.70/10.48.208.70:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2018-10-05 21:08:21,003] WARN Session 0x34663b434985000e for server null, unexpected error, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [2018-10-05 21:08:21,797] INFO Opening socket connection to server 10.48.210.44/10.48.210.44:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2018-10-05 21:08:21,799] INFO Socket connection established to 10.48.210.44/10.48.210.44:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2018-10-05 21:08:21,802] INFO Session establishment complete on server 10.48.210.44/10.48.210.44:2181, sessionid = 0x34663b434985000e, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2018-10-05 21:08:28,015] INFO Creating /controller (is it secure? false) (kafka.zk.KafkaZkClient) [2018-10-05 21:08:28,015] INFO Creating /controller (is it secure? false) (kafka.zk.KafkaZkClient) [2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, node already exists and owner '3703712903740981258' does not match current session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral) [2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, node already exists and owner '3703712903740981258' does not match current session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral) [2018-10-05 21:08:28,025] INFO Result of znode creation at /controller is: NODEEXISTS (kafka.zk.KafkaZkClient) [2018-10-05 21:08:28,025] INFO Result of znode creation at /controller is: NODEEXISTS (kafka.zk.KafkaZkClient) [2018-10-05 21:08:42,561] INFO [Partition -store-changelog-7 broker=1] Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition) [2018-10-05 21:08:42,561] INFO [Partition -store-changelog-7 broker=1] Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition) [2018-10-05 21:08:42,569] INFO [Partition -store-changelog-7 broker=1] Cached zkVersion [11] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2018-10-05 21:08:42,569] INFO [Partition -store-changelog-7 broker=1] Cached zkVersion [11] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2018-10-05 21:08:42,569] INFO [Partition bycontact_0-19 broker=1] Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition) [2018-10-05 21:08:42,569] INFO [Partition bycontact_0-19 broker=1] Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition) [2018-10-05 21:08:42,574] INFO [Partition bycontact_0-19 broker=1] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2018-10-05 21:08:42,574] INFO [Partition bycontact_0-19 broker=1] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition){code} The only way in order to recover was to restart the broker. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341892#comment-16341892 ] Oleksiy Stashok commented on KAFKA-2729: Thank you, that really helped! > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341863#comment-16341863 ] Jun Rao commented on KAFKA-2729: The problem that we fixed related to this jira is KAFKA-5642. Previously, when the controller's ZK session expires and loses its controller-ship, it's possible for this zombie controller to continue updating ZK and/or sending LeaderAndIsrRequests to the brokers for a short period of time. When this happens, the broker may not have the most up-to-date information about leader and isr, which can lead to subsequent ZK failure when isr needs to be updated. KAFKA-5642 fixes the issue by handling the ZK session expiration event properly. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341725#comment-16341725 ] Oleksiy Stashok commented on KAFKA-2729: [~ijuma] can you please provide more information on the issue you guys fixed, because here people report issues, which may or may not be related, so it would be good to understand what exactly you guys were able to reproduce and fixed. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334281#comment-16334281 ] Martin Nowak commented on KAFKA-2729: - Confirmed that for me all occurrences of this issue were preceded by ZK session timeouts/expirations. Looking forward to have this finally fixed. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko >Assignee: Onur Karaman >Priority: Major > Fix For: 1.1.0 > > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220284#comment-16220284 ] Francesco vigotti commented on KAFKA-2729: -- I've maybe found the problem to my issue which maybe is not related to this topic because in my case simple broker restart didn't worked, I've create a dedicated issue then... https://issues.apache.org/jira/browse/KAFKA-6129 > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204367#comment-16204367 ] Jun Rao commented on KAFKA-2729: [~fravigotti], sorry to hear that. A couple of quick suggestions. (1) Do you see any ZK session expiration in the log (e.g., INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient))? There are known bugs in Kafka in handling ZK session expiration. So, it would be useful to avoid it in the first place. Typical causes of ZK session expiration are long GC in the broker or network glitches. So you can either tune the broker or increase zookeeper.session.timeout.ms. (2) Do you have lots of partitions (say a few thousands) per broker? If so, you want to check if the controlled shutdown succeeds when shutting down a broker. If not, restarting the broker too soon could also lead the cluster to a weird state. To address this issue, you can increase request.timeout.ms on the broker. We are fixing the known issue in (1) and improving the performance with lots of partitions in (2) in KAFKA-5642 and we expect the fix to be included in the 1.1.0 release in Feb. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203772#comment-16203772 ] Ismael Juma commented on KAFKA-2729: If you're seeing the issue this often, then there's most likely a configuration issue. If you file a separate issue with all the logs (including GC logs) and configs (broker and ZK), maybe someone can help. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203749#comment-16203749 ] Francesco vigotti commented on KAFKA-2729: -- After having lost 2 days on this I've reset whole cluster, stopped all kafka brokers, stopped zookeeper cluster, delete all directories,stopped all consumer and producer ,then restarted everything , recreated topics and now guess what? :) one node reports... {code:java} [2017-10-13 15:54:52,893] INFO Partition [__consumer_offsets,5] on broker 2: Expanding ISR for partition __consumer_offsets-5 from 10,13,2 to 10,13,2,5 (kafka.cluster.Partition) [2017-10-13 15:54:52,906] INFO Partition [__consumer_offsets,5] on broker 2: Cached zkVersion [13] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-10-13 15:54:52,908] INFO Partition [__consumer_offsets,25] on broker 2: Expanding ISR for partition __consumer_offsets-25 from 10,2,13 to 10,2,13,5 (kafka.cluster.Partition) [2017-10-13 15:54:52,915] INFO Partition [__consumer_offsets,25] on broker 2: Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-10-13 15:54:52,916] INFO Partition [__consumer_offsets,45] on broker 2: Expanding ISR for partition __consumer_offsets-45 from 10,13,2 to 10,13,2,5 (kafka.cluster.Partition) [2017-10-13 15:54:52,925] INFO Partition [__consumer_offsets,45] on broker 2: Cached zkVersion [15] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-10-13 15:54:52,926] INFO Partition [__consumer_offsets,5] on broker 2: Expanding ISR for partition __consumer_offsets-5 from 10,13,2 to 10,13,2,5 (kafka.cluster.Partition) [2017-10-13 15:54:52,936] INFO Partition [__consumer_offsets,5] on broker 2: Cached zkVersion [13] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-10-13 15:54:52,939] INFO Partition [__consumer_offsets,25] on broker 2: Expanding ISR for partition __consumer_offsets-25 from 10,2,13 to 10,2,13,5 (kafka.cluster.Partition) {code} while others {code:java} [2017-10-13 15:57:08,128] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,40] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 15:57:09,129] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,40] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 15:57:10,260] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,40] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 15:57:11,262] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,40] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 15:57:12,265] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,40] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 15:57:13,289] ERROR [ReplicaFetcherThread-0-2], Error fo {code} cluster still being inconsistent, I've also added 2 more nodes hoping in an increasing of stability but nothing, I don't know if something is wrong because if kafka do some kind of pre-flight checks during startup it does log nothing.. the only logs are those which have no sense because the leader should be re-elected when there are ISR available.. and there are I've started looking for an alternative software to use, I'm trying to use kafka is so frustrating :( > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203331#comment-16203331 ] Francesco vigotti commented on KAFKA-2729: -- At the beginning of my cluster screw up I've got tons of zkVersion issue that's why I've posted here , but because seems that the problems for you goes away when you restarted your brokers maybe my problem is different.. kafka version : 0.10.2.1 > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203244#comment-16203244 ] Ismael Juma commented on KAFKA-2729: [~fravigotti], none of your log messages seems to be about the zkVersion issue, is it really the same issue as this one? If not, you should file a separate JIRA including the Kafka version. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203198#comment-16203198 ] Francesco vigotti commented on KAFKA-2729: -- I'm having the same issue and definitely losing trust in kafka, every 2 months there is something that force me to reset the whole cluster, I'm searching for a good alternative for a distributed-persisted-fast-queue for a while.. yet to find something that give me a good vibe.. anyway I'm facing this same issue with some small differences - restarting all brokers ( together and rolling-restart ) didn't fix it.. all brokers in the cluster log such errors : --- broker 5 {code:java} [2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,17] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,23] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,47] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) [2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition [__consumer_offsets,29] to broker 2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread) {code} --- broker3 ) {code:java} [2017-10-13 08:13:58,547] INFO Partition [__consumer_offsets,20] on broker 3: Expanding ISR for partition __consumer_offsets-20 from 3,2 to 3,2,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,551] INFO Partition [__consumer_offsets,44] on broker 3: Expanding ISR for partition __consumer_offsets-44 from 3,2 to 3,2,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,554] INFO Partition [__consumer_offsets,5] on broker 3: Expanding ISR for partition __consumer_offsets-5 from 2,3 to 2,3,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,557] INFO Partition [__consumer_offsets,26] on broker 3: Expanding ISR for partition __consumer_offsets-26 from 3,2 to 3,2,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,563] INFO Partition [__consumer_offsets,29] on broker 3: Expanding ISR for partition __consumer_offsets-29 from 2,3 to 2,3,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,566] INFO Partition [__consumer_offsets,32] on broker 3: Expanding ISR for partition __consumer_offsets-32 from 3,2 to 3,2,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,570] INFO Partition [legacyJavaVarT,2] on broker 3: Expanding ISR for partition legacyJavaVarT-2 from 3 to 3,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,573] INFO Partition [test4,3] on broker 3: Expanding ISR for partition test4-3 from 2,3 to 2,3,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,577] INFO Partition [test4,0] on broker 3: Expanding ISR for partition test4-0 from 3,2 to 3,2,5 (kafka.cluster.Partition) [2017-10-13 08:13:58,582] INFO Partition [test3,5] on broker 3: Expanding ISR for partition test3-5 from 3 to 3,5 (kafka.cluster.Partition) {code} --- broker2 {code:java} [2017-10-13 08:13:36,289] INFO Partition [__consumer_offsets,11] on broker 2: Expanding ISR for partition __consumer_offsets-11 from 2,5 to 2,5,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,293] INFO Partition [__consumer_offsets,41] on broker 2: Expanding ISR for partition __consumer_offsets-41 from 2,5 to 2,5,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,296] INFO Partition [test3,2] on broker 2: Expanding ISR for partition test3-2 from 2 to 2,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,300] INFO Partition [__consumer_offsets,23] on broker 2: Expanding ISR for partition __consumer_offsets-23 from 2,5 to 2,5,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,304] INFO Partition [__consumer_offsets,5] on broker 2: Expanding ISR for partition __consumer_offsets-5 from 2,5 to 2,5,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,337] INFO Partition [__consumer_offsets,35] on broker 2: Expanding ISR for partition __consumer_offsets-35 from 2,5 to 2,5,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,372] INFO Partition [test_mainlog,24] on broker 2: Expanding ISR for partition test_mainlog-24 from 2 to 2,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,375] INFO Partition [test_mainlog,6] on broker 2: Expanding ISR for partition test_mainlog-6 from 2 to 2,3 (kafka.cluster.Partition) [2017-10-13 08:13:36,379] INFO Partition [test_mainlog,18] on broker 2: Expanding ISR for partition test_mainlog-18 from 2 to 2,3
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196853#comment-16196853 ] sumit jain commented on KAFKA-2729: --- Facing the same issue.. here's the question I asked on stack overflow https://stackoverflow.com/questions/46644764/kafka-cached-zkversion-not-equal-to-that-in-zookeeper-broker-not-recovering > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145648#comment-16145648 ] Peter Davis commented on KAFKA-2729: [~junrao] Per your previous [comment](https://issues.apache.org/jira/browse/KAFKA-2729?focusedCommentId=16107042=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16107042), is this issue definitely covered under [KAFKA-5027](https://issues.apache.org/jira/browse/KAFKA-5027) then? It is not linked there. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131341#comment-16131341 ] Joseph Aliase commented on KAFKA-2729: -- Have happened to us twice in Prod. Restart seems to be a only solution. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106965#comment-16106965 ] Dan commented on KAFKA-2729: Happened in 0.11.0.0 as well. Had to restart the broker to bring it back to operational state. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060110#comment-16060110 ] Jun Rao commented on KAFKA-2729: [~timoha], we are trying to address the ZK session expiration issue in the controller improvement work under https://issues.apache.org/jira/browse/KAFKA-5027. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060049#comment-16060049 ] Andrey Elenskiy commented on KAFKA-2729: Seeing the same issue on 0.10.2. A node running zookeeper lost networking for split second and initiated an election which caused some sessions to expire with: ``` 2017-06-22 02:07:36,092 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running ``` which caused controller resignation: ``` [2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener) [2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, broker id 158980 (kafka.controller.KafkaController) [2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering IsrChangeNotificationListener (kafka.controller.KafkaController) [2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: Stopped partition state machine (kafka.controller.PartitionStateMachine) [2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: Stopped replica state machine (kafka.controller.ReplicaStateMachine) [2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as the controller (kafka.controller.KafkaController) ``` and after that just kept getting this in broker's server logs for next 8 hours until just restarting manually it: ``` [2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR for partition [A,5] from 158980,133641,155394 to 158980 (kafka.cluster.Partition) [2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached zkVersion [73] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ``` > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)