[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042543#comment-16042543 ] Pablo commented on KAFKA-2729: -- Guys, this issue is not only affecting to 0.8.2.1 as many people here are saying. We had this same problem on a 0.10.2 during an upgrade from 0.8.2, we workaround it increasing the zk session and connection timeouts and worked fine, but we don't feel very safe. I suggest to add all the affected versions people are saying. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967735#comment-15967735 ] Edoardo Comar commented on KAFKA-2729: -- FWIW - we saw the same message {{ Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) }} when redeploying kafka 0.10.0.1 in a cluster after we had run 0.10.2.0 after having wiped kafka's storage, but having kept zookeeper's version (the one bundled with kafka 0.10.2) and its storage For us eventually the cluster recovered. HTH. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967705#comment-15967705 ] Jun Rao commented on KAFKA-2729: Thanks for the additional info. In both [~Ronghua Lin] and [~allenzhuyi]'s case, it seems ZK session expiration had happened. As I mentioned earlier in the jira, there is a known issue reported in KAFKA-3083 that when the controller's ZK session expires and loses its controller-ship, it's possible for this zombie controller to continue updating ZK and/or sending LeaderAndIsrRequests to the brokers for a short period of time. When this happens, the broker may not have the most up-to-date information about leader and isr, which can lead to subsequent ZK failure when isr needs to be updated. It may take some time to have this issue fixed. In the interim, the workaround for this issue is to make sure ZK session expiration never happens. This first thing is to figure out what's causing the ZK session to expire. Two common causes are (1) long broker GC and (2) network glitches. For (1), one needs to tune the GC in the broker properly. For (2), one can look at the reported time that the ZK client can't hear from the ZK server and increase the ZK session expiration time according. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962085#comment-15962085 ] allenzhuyi commented on KAFKA-2729: --- we see the bug in kafka_2.12-0.10.2.0,when there is long lantency ping timeout.We have 3 brokers,2 Brokes find the error below [ Partition [__consumer_offsets,9] on broker 3: Shrinking ISR for partition [__consumer_offsets,9] from 3,1,2 to 3,2 (kafka.cluster.Paition) Partition [__consumer_offsets,9] on broker 3: Cached zkVersion [89] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ] but another one broker does'n realize the bug,the producer continuly report timeout exception and send message fail,Until finally the broker zk session expired and re-registering broker info in ZK for broker. The System is recovered. It is a serious bug,Please help us to solve it quickly. Thank you. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947500#comment-15947500 ] Sam Nguyen commented on KAFKA-2729: --- We ran into this today on kafka_2.11-0.10.0.1. There is unexpected behavior with regards to partition availability. One out of 3 total brokers in our cluster entered this state (emitting "Cached zkVersion [140] not equal to that in zookeeper, skip updating ISR" errors). We have our producer "required acks" config set to wait for all (-1), and the min.insync.replicas set to 2. I would have expected to be able to still be able to produce to the topic, but our producer (sarama) was getting timeouts. After restarting the broken broker, we were able to continue producing. I confirmed that even after performing a graceful shutdown on 1 out of 3 brokers, we are still able to produce since we have 2 out of 3 brokers still alive to serve produce and acknowledge produce requests. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937468#comment-15937468 ] Stephane Maarek commented on KAFKA-2729: If I may add, this is a pretty bad issue, but it got worse. You not only have to recover Kafka, but also recover your Kafka Connect ClusterS. They got stuck for me in the following state: [2017-03-23 00:06:05,478] INFO Marking the coordinator kafka-1:9092 (id: 2147483626 rack: null) dead for group connect-MyConnector (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) [2017-03-23 00:06:05,478] INFO Marking the coordinator kafka-1:9092 (id: 2147483626 rack: null) dead for group connect-MyConnector (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933991#comment-15933991 ] Ronghua Lin commented on KAFKA-2729: [~junrao], we also have this problem in a small cluster which has 3 brokers, running Kafka 0.10.1.1. When it happened, the logs of each broker look like this: {code:title=broker 2 | borderStyle=solid} [2017-03-20 01:03:48,903] INFO [Group Metadata Manager on Broker 2]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) [2017-03-20 01:13:27,283] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-03-20 01:13:27,293] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral) [2017-03-20 01:13:27,294] INFO 2 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2017-03-20 01:13:28,203] INFO re-registering broker info in ZK for broker 2 (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-20 01:13:28,205] INFO Creating /brokers/ids/2 (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-03-20 01:13:28,218] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral) [2017-03-20 01:13:28,219] INFO Registered broker 2 at path /brokers/ids/2 with addresses: PLAINTEXT -> EndPoint(x, ,PLAINTEXT) (kafka.utils.ZkUtils) [2017-03-20 01:13:28,219] INFO done re-registering broker (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-20 01:13:28,220] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-20 01:13:28,224] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-03-20 01:13:28,227] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-03-20 01:13:38,812] INFO Partition [topic1,1] on broker 2: Shrinking ISR for partition [topic1,1] from 0,2,1 to 2,1 (kafka.cluster.Partition) [2017-03-20 01:13:38,825] INFO Partition [topic1,1] on broker 2: Cached zkVersion [6] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-03-20 01:13:38,825] INFO Partition [topic2,1] on broker 2: Shrinking ISR for partition [topic2,1] from 0,2,1 to 2,1 (kafka.cluster.Partition) [2017-03-20 01:13:38,835] INFO Partition [topic2,1] on broker 2: Cached zkVersion [6] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-03-20 01:13:38,835] INFO Partition [topic3,0] on broker 2: Shrinking ISR for partition [topic3,0] from 0,2,1 to 2,1 (kafka.cluster.Partition) [2017-03-20 01:13:38,847] INFO Partition [topic3,0] on broker 2: Cached zkVersion [6] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} {code:title=broker 1 | borderStyle=solid} [2017-03-20 01:03:38,255] INFO [Group Metadata Manager on Broker 1]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager) [2017-03-20 01:13:27,451] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-03-20 01:13:27,490] INFO re-registering broker info in ZK for broker 1 (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-20 01:13:27,491] INFO Creating /brokers/ids/1 (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-03-20 01:13:27,503] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral) [2017-03-20 01:13:27,503] INFO Registered broker 1 at path /brokers/ids/1 with addresses: PLAINTEXT -> EndPoint(,,PLAINTEXT) (kafka.utils.ZkUtils) [2017-03-20 01:13:27,504] INFO done re-registering broker (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-20 01:13:27,504] INFO Subscribing to /brokers/topics path to watch for new topics (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-20 01:13:27,508] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-03-20 01:13:38,134] INFO Partition [__consumer_offsets,40] on broker 1: Shrinking ISR for partition [__consumer_offsets,40] from 1,0 to 1 (kafka.cluster.Partition) [2017-03-20 01:13:38,155] INFO Partition [__consumer_offsets,40] on broker 1: Cached zkVersion [2] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-03-20 01:13:38,156] INFO Partition [__consumer_offsets,0] on broker 1: Shrinking ISR for partition [__consumer_offsets,0] from 1,0 to 1 (kafka.cluster.Partition) [2017-03-20 01:13:38,161] INFO Partition [__consumer_offsets,0] on broker 1: Cached zkVersion [2] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-03-20 01:13:38,162] INFO Partition [__consumer_offsets,12] on broker 1: Shrinking ISR for partition [__consumer_offsets,12] from 1,0 to 1 (kafka.cluster.Partition) [2017-03-20 01:13:38,170] INFO Partition [__consumer_offsets,12] on broker 1: Cached zkVersion [2] not equal to that in zookeeper, skip
Re: [jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
Jun, I see that line elsewhere in the cluster. I don't see it happening on that particular broker that ran into the problem. On Mon, Mar 6, 2017 at 5:02 PM, Jun Rao (JIRA)wrote: > > [ https://issues.apache.org/jira/browse/KAFKA-2729?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel=15898415#comment-15898415 ] > > Jun Rao commented on KAFKA-2729: > > > [~mjuarez], did you see ZK session expiration in the server.log in the > controller around that time? The log will look like the following. > > INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) > > > Cached zkVersion not equal to that in zookeeper, broker not recovering. > > --- > > > > Key: KAFKA-2729 > > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > > Project: Kafka > > Issue Type: Bug > >Affects Versions: 0.8.2.1 > >Reporter: Danil Serdyuchenko > > > > After a small network wobble where zookeeper nodes couldn't reach each > other, we started seeing a large number of undereplicated partitions. The > zookeeper cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing > this in the logs: > > {code} > > [2015-10-27 11:36:00,888] INFO Partition > > [__samza_checkpoint_event-creation_1,3] > on broker 5: Shrinking ISR for partition > [__samza_checkpoint_event-creation_1,3] > from 6,5 to 5 (kafka.cluster.Partition) > > [2015-10-27 11:36:00,891] INFO Partition > > [__samza_checkpoint_event-creation_1,3] > on broker 5: Cached zkVersion [66] not equal to that in zookeeper, skip > updating ISR (kafka.cluster.Partition) > > {code} > > For all of the topics on the effected brokers. Both brokers only > recovered after a restart. Our own investigation yielded nothing, I was > hoping you could shed some light on this issue. Possibly if it's related > to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're > using 0.8.2.1. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.15#6346) >
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898415#comment-15898415 ] Jun Rao commented on KAFKA-2729: [~mjuarez], did you see ZK session expiration in the server.log in the controller around that time? The log will look like the following. INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891355#comment-15891355 ] mjuarez commented on KAFKA-2729: We are also running into this problem in our staging cluster, running Kafka 0.10.0.1. Basically it looks like this happened yesterday: {noformat} [2017-02-28 18:41:33,513] INFO Client session timed out, have not heard from server in 7799ms for sessionid 0x159d7893eab0088, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) {noformat} I'm attributing that to a transient network issue, since we haven't seen any other issues. And less than a minute later, we started seeing these errors: {noformat} [2017-02-28 18:42:45,739] INFO Partition [analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Shrinking ISR for partition [analyticsInfrastructure_KafkaAvroUserMessage,16] from 102,101,105 to 101 (kaf [2017-02-28 18:42:45,751] INFO Partition [analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Cached zkVersion [94] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,751] INFO Partition [qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Shrinking ISR for partition [qa_exporter11_slingshot_salesforce_invoice,6] from 101,105,104 to 101 (kafka.clu [2017-02-28 18:42:45,756] INFO Partition [qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,756] INFO Partition [GNRDEV_counters_singleCount,2] on broker 101: Shrinking ISR for partition [GNRDEV_counters_singleCount,2] from 101,105,104 to 101 (kafka.cluster.Partition) [2017-02-28 18:42:45,761] INFO Partition [GNRDEV_counters_singleCount,2] on broker 101: Cached zkVersion [334] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,761] INFO Partition [sod-spins-spark-local,1] on broker 101: Shrinking ISR for partition [sod-spins-spark-local,1] from 101,103,104 to 101 (kafka.cluster.Partition) [2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,1] on broker 101: Cached zkVersion [379] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,11] on broker 101: Shrinking ISR for partition [sod-spins-spark-local,11] from 102,101,105 to 101 (kafka.cluster.Partition) [2017-02-28 18:42:45,767] INFO Partition [sod-spins-spark-local,11] on broker 101: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {noformat} The "current" server is 101. So it thinks it's the leader for basically every partition on that node, but it's refusing to update the ISRs, because the cached zkversion doesn't match the one in zookeeper. This is causing permanently under-replicated partitions, because server doesn't ever catch up, since it doesn't think there's a problem. Also, the metadata reported by the 101 server to consumers indicates it thinks it's part of the ISR, but every other broker doesn't think so. Let me know if more logs/details would be helpful. I'll try to fix this by restarting the node, and hopefully it fixes the issue. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880817#comment-15880817 ] Jun Rao commented on KAFKA-2729: [~prasincs], if the controller is partitioned off other brokers and ZK, the expected flow is the following: (1) ZK server detects that the old controller's session expires; (2) the controller path is removed by ZK; (3) a new controller is elected and changes leaders/isrs; (4) network is back on the old controller; (5) old controller receives ZK session expiration event; (6) old controller stops doing the controller stuff and resign. Note that the old controller doesn't really know that it's no longer the controller until step (5). The gap we have now is that step (6) is not done in a timely fashion. Are you deploying Kafka in the same data center? What kind of network partitions are you seeing? Typically, we expect network partitions are rare within the same data center. If there are short network glitches, one temporary fix is to increase the ZK session timeout to accommodate for that until the network issue is fixed. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879885#comment-15879885 ] Prasanna Gautam commented on KAFKA-2729: [~junrao] Thanks for looking into this. Do you mind elaborating on what you need to change in the ZK api and whether https://issues.apache.org/jira/browse/KAFKA-3083 is going to solve it. The issue here is that -- in case of network partitions which are unrelated to the 3 points and can happen at any time, this can happen and leave the brokers in messed up state until a restart. Can this be fixed by handling the ZK Connection errors? If restarting the broker is the only fix, maybe the proper thing to do is to crash and let supervisor, etc. restart the service? > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879700#comment-15879700 ] Jun Rao commented on KAFKA-2729: Sorry to hear about the impact to production. Grant mentioned ZK session expiration, which is indeed a potential cause of this issue. A related issue has been reported in KAFKA-3083. The issue is that when the controller's ZK session expires and loses its controller-ship, it's possible for this zombie controller to continue updating ZK and/or sending LeaderAndIsrRequests to the brokers for a short period of time. When this happens, the broker may not have the most up-to-date information about leader and isr, which can lead to subsequent ZK failure when isr needs to be updated. Fixing this issue requires us change the way how we use the ZK api and may take some time. In the interim, one suggestion is to make sure ZK session expiration never happens. This can be achieved by making sure that (1) ZK servers are performing, (2) the brokers don't have long GCs, (3) the ZK session expiration time is large enough. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879554#comment-15879554 ] Kane Kim commented on KAFKA-2729: - In my opinion it doesn't matter what's causing it (in our case that was indeed lost packets to zookeeper), the culprit is that brokers will not recover by itself until rolling restart. This is real problem and has to be fixed. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879358#comment-15879358 ] Prateek Jaipuria commented on KAFKA-2729: - [~granthenke] We don't see any zookeeper disconnections. Just {code} INFO Partition [topic,n] on broker m: Cached zkVersion [xxx] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879213#comment-15879213 ] Dave Thomas commented on KAFKA-2729: [~granthenke] We don't see brokers recovering. The message we see is: {noformat} Cached zkVersion [xxx] not equal to that in zookeeper, skip updating ISR {noformat} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879124#comment-15879124 ] Grant Henke commented on KAFKA-2729: I am curious if everyone on this Jira is actually seeing the reported issue. I have had multiple cases where someone presented my with an environment they thought was experiencing this issue. After researching the environment and logs, to date it has always been something else. The main culprits so far have been: * Long GC pauses causing zookeeper sessions to timeout * Slow or poorly configured zookeeper * Bad network configuration All of the above resulted in a soft reoccurring failure of brokers. That churn often caused addition load perpetuating the issue. If you are seeing this issue do you see the following pattern repeating in the logs?: {noformat} INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (Disconnected) ... INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (Expired) INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x153ab38abdbd360 has expired, closing socket connection ... INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (SyncConnected) INFO kafka.server.KafkaHealthcheck: re-registering broker info in ZK for broker 32 INFO kafka.utils.ZKCheckedEphemeral: Creating /brokers/ids/32 (is it secure? false) INFO kafka.utils.ZKCheckedEphemeral: Result of znode creation is: OK {noformat} If so, something is causing communication with zookeeper to take too long and the broker is unregistering itself. This will cause ISRs to shrink and expand over and over again. I don't think this will solve everyones issue here, but hopefully it will help solve some. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879076#comment-15879076 ] Prateek Jaipuria commented on KAFKA-2729: - Having the same issue with 0.10.1.0 on a 8 node cluster. Restarting node also does not help, the problem just moves on to another node. This is becoming a deal breaker. Definitely losing trust in Kafka. Definitely a BLOCKER! > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869703#comment-15869703 ] Vladimír Kleštinec commented on KAFKA-2729: --- [~elevy] Agree, we are experiencing same issue, this is real blocker and we are loosing trust in Kafka... > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868971#comment-15868971 ] Elias Levy commented on KAFKA-2729: --- Hit this again during testing with 0.10.0.1 on a 10 node broker cluster with a 3 node ZK ensemble. This should have priority Blocker instead of Major. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865486#comment-15865486 ] Prasanna Gautam commented on KAFKA-2729: This is still replicable in Kafka 0.10.1.1 when Kafka brokers are partitioned from each other and zookeeper gets disconnected from the brokers briefly and comes back. This situation leads to brokers getting stuck in comparing Cached zkVersion and unable to expand the ISR. The code in Partition.scala does not seem to be handling enough error conditions other than the stale zkVersion. In addition to skipping in the current loop, I think it should reconnect to zookeeper to update the current state and version. Here's a suggestion to do this.. doing it asynchronously doesn't break the flow and you can update the state. ZkVersion may not be the only thing to update here. {code} val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(r => r.brokerId).toList, zkVersion) val (updateSucceeded,newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partitionId, newLeaderAndIsr, controllerEpoch, zkVersion) if(updateSucceeded) { replicaManager.recordIsrChange(new TopicAndPartition(topic, partitionId)) inSyncReplicas = newIsr zkVersion = newVersion trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion)) } else { info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion)) zkVersion = asyncUpdateTopicPartitionVersion(topic,partitionId) } {code} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860948#comment-15860948 ] Sinóros-Szabó Péter commented on KAFKA-2729: Do you have any plan to resolve this? Or is there a workaround for this issue? > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840494#comment-15840494 ] Dave Thomas commented on KAFKA-2729: Same with us, on 0.10.1.1 (following upgrade from 0.10.1.0 where we saw the same issue). > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826905#comment-15826905 ] Sinóros-Szabó Péter commented on KAFKA-2729: Same issue on 0.10.1.1. Do you need logs? I can collect them next time I see this. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690420#comment-15690420 ] derek commented on KAFKA-2729: -- I'm on 0.10.1.0 and seeing the same thing. Maybe related to [~cmolter] is saying above, what we see in the logs just prior to a broker becoming under-replicated is a flurry of {noformat} org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. {noformat} messages. After that we see a bunch of activity around adding and removing fetchers, then it goes into the infinite ISR shrink loop. The only way we can recover is to restart. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690421#comment-15690421 ] derek commented on KAFKA-2729: -- I'm on 0.10.1.0 and seeing the same thing. Maybe related to [~cmolter] is saying above, what we see in the logs just prior to a broker becoming under-replicated is a flurry of {noformat} org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. {noformat} messages. After that we see a bunch of activity around adding and removing fetchers, then it goes into the infinite ISR shrink loop. The only way we can recover is to restart. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467691#comment-15467691 ] Charly Molter commented on KAFKA-2729: -- Hi, We had this issue on a test cluster so I took time to investigate some more. We had a bunch of disconnections to Zookeeper and we had 2 changes of controller in a short time. Broker 103 was leader with epoch 44 Broker 104 was leader with epoch 45 I looked at one specific partitions and found the following pattern: 101 was the broker which thought was leader but kept failing shrink the ISR with: Partition [verifiable-test-topic,0] on broker 101: Shrinking ISR for partition [verifiable-test-topic,0] from 101,301,201 to 101,201 Partition [verifiable-test-topic,0] on broker 101: Cached zkVersion [185] not equal to that in zookeeper, skip updating ISR Looking at ZK we have: get /brokers/topics/verifiable-test-topic/partitions/0/state {"controller_epoch":44,"leader":301,"version":1,"leader_epoch":96,"isr":[301]} And metadata (to a random broker) is saying: Topic: verifiable-test-topicPartition: 0Leader: 301 Replicas: 101,201,301 Isr: 301 Digging in the logs here’s what we think happened: 1. 103 sends becomeFollower to 301 with epoch 44 and leaderEpoch 95 2. 104 sends becomeLeader to 101 with epoch 45 and leaderEpoch 95 (after update zk!) 3. 103 sends becomeLeader to 301 with epoch 44 and leaderEpoch 96 (after updating zk!) 4. 104 sends becomeFollower to 301 with epoch 45 and leaderEpoch 95 4) Is ignored by 301 as the leaderEpoch is older than the current one. We are missing a request: 103 sends becomeFollower to 101 with epoch 44 and leaderEpoch 95 I believe this happened because when the controller steps down it empties its request queue so this request never left the controller: https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/ControllerChannelManager.scala#L53-L57 So we ended up in a case where 301 and 101 think they are both leaders. Obviously 101 wants to update the state in ZK to remove 301 as it’s not even fetching from 101. Does this seem correct to you? It seems impossible to avoid having no Controller overlap, which could make it quite hard to avoid having 2 leaders for a short time. Though there should be a way for this situation to get back to a good state. I believe the impact of this would be: - writes = -1 unavailability - writes != -1 possible log divergence depending on min in-sync replicas (I’m unsure about this). Hope this helps. While I had to fix the cluster by bouncing a node I kept most of the logs so let me know if you need more info. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412359#comment-15412359 ] Kane Kim commented on KAFKA-2729: - For us the reason was high percentage of lost packets to one of ZK nodes (from broker to ZK). After we fixed that, situation got a lot better. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412126#comment-15412126 ] Michael Sandrof commented on KAFKA-2729: We seem to be having a similar problem running 0.10.0.0. However no amount of broker restarting corrects the problem. Once it happens, I see periodic "Cached zkVersion" messages along with complete instability in the ISRs. Continuous shrinking and expanding of the ISRs that makes the cluster unusable as we need 2 ISRs for our durability requirements. The only thing that fixes the problem is to delete all topics, recreate and reload. This isn't a practical approach for our production system in which we are using Kafka as a transactionally consistent replica of a relational database. Anyone have any clues about how to prevent this from happening? > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402900#comment-15402900 ] Joshua Dickerson commented on KAFKA-2729: - This has bit us twice in our live environment. 0.9.0.1 Restarting the affected broker(s) is the only thing that seems to fix it. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399928#comment-15399928 ] Konstantin Zadorozhny commented on KAFKA-2729: -- Seeing the same issue in our staging and production environments on 0.9.0.1. Bouncing brokers helps, but still not ideal. Staging cluster were left to "recover" for a day. Didn't happen. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395915#comment-15395915 ] James Carnegie commented on KAFKA-2729: --- That's our experience, though the only other thing we've tried is leaving it for a while. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395821#comment-15395821 ] William Yu commented on KAFKA-2729: --- We are also seeing this in our production cluster: Running on Kafka: 0.9.0.1 Is restarting the only solution? {code} [2016-07-27 14:36:15,807] INFO Partition [tasks,265] on broker 4: Cached zkVersion [182] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-07-27 14:36:15,807] INFO Partition [tasks,150] on broker 4: Shrinking ISR for partition [tasks,150] from 6,4,7 to 4 (kafka.cluster.Partition) {code} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375878#comment-15375878 ] Tyler Bischel commented on KAFKA-2729: -- We are also seeing this issue in 0.10.0.0 pretty much daily right now. {code} [2016-07-13 21:30:50,170] 1292384 [kafka-scheduler-0] INFO kafka.cluster.Partition - Partition [events,580] on broker 10432234: Cached zkVersion [1267] not equal to that in zookeeper, skip updating ISR {code} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366237#comment-15366237 ] Chris Rodier commented on KAFKA-2729: - We also observed this identical issue, on 0.9.0.1 today. Restart of the failed broker resolved the issue without difficulty as a work around. This seems like a high priority issue where you could lose nodes, and/or lose a cluster fairly easily due to zookeeper instability / elections. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306654#comment-15306654 ] Joel Pfaff commented on KAFKA-2729: --- We have hit that as well on 0.9.0.1 today, same logs, and only a reboot of the faulty broker recovered the problem. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263678#comment-15263678 ] Stig Rohde Døssing commented on KAFKA-2729: --- We hit this on 0.9.0.1 today {code} [2016-04-28 19:18:22,834] INFO Partition [dce-data,13] on broker 3: Shrinking ISR for partition [dce-data,13] from 3,2 to 3 (kafka.cluster.Partition) [2016-04-28 19:18:22,845] INFO Partition [dce-data,13] on broker 3: Cached zkVersion [304] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-04-28 19:18:32,785] INFO Partition [dce-data,16] on broker 3: Shrinking ISR for partition [dce-data,16] from 3,2 to 3 (kafka.cluster.Partition) [2016-04-28 19:18:32,803] INFO Partition [dce-data,16] on broker 3: Cached zkVersion [312] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {code} which continued until we rebooted broker 3. The ISR at this time in Zookeeper had only broker 2, and there was no leader for the affected partitions. I believe the preferred leader for these partitions was 3. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261189#comment-15261189 ] Kane Kim commented on KAFKA-2729: - Same problem with the same symptoms occurred on kafka 0.8.2.1. After network glitch brokers fall out of ISR set with Cached zkVersion [5] not equal to that in zookeeper, skip updating ISR Broker never recovers from this state until restart. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171252#comment-15171252 ] Michal Harish commented on KAFKA-2729: -- Hit this on Kafka 0.8.2.2 as well > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158771#comment-15158771 ] Petri Lehtinen commented on KAFKA-2729: --- This happened to me (again) a few days ago on 0.9.0.0 on a cluster of 2 kafka nodes. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151423#comment-15151423 ] James Cheng commented on KAFKA-2729: We ran into the same issue today, when running 0.9.0.0. {quote} [2016-02-17 22:49:52,638] INFO Partition [the.topic.name,22] on broker 2: Cached zkVersion [5] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {quote} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133716#comment-15133716 ] Elias Levy commented on KAFKA-2729: --- Had the same issue happen here while testing a 5 node Kafka cluster with a 3 node ZK ensemble on Kubernetes on AWS. After running for a while broker 2 started showing the "Cached zkVersion [29] not equal to that in zookeeper, skip updating ISR" error message for al the partitions it leads. For those partition it is the only in sync replica. That has led to the Samza jobs I was running to stop. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095901#comment-15095901 ] Andres Gomez Ferrer commented on KAFKA-2729: Our kafka cluster met the same issue too: ``` [2016-01-12 01:16:15,907] INFO Partition [__consumer_offsets,10] on broker 0: Shrinking ISR for partition [__consumer_offsets,10] from 0,1 to 0 (kafka.cluster.Partition) [2016-01-12 01:16:15,909] INFO Partition [__consumer_offsets,10] on broker 0: Cached zkVersion [3240] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-01-12 01:16:15,909] INFO Partition [__consumer_offsets,45] on broker 0: Shrinking ISR for partition [__consumer_offsets,45] from 0,1 to 0 (kafka.cluster.Partition) [2016-01-12 01:16:15,911] INFO Partition [__consumer_offsets,45] on broker 0: Cached zkVersion [3192] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2016-01-12 01:16:15,911] INFO Partition [__consumer_offsets,24] on broker 0: Shrinking ISR for partition [__consumer_offsets,24] from 0,1 to 0 (kafka.cluster.Partition) [2016-01-12 01:16:15,912] INFO Partition [__consumer_offsets,24] on broker 0: Cached zkVersion [3233] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ``` Kafka version is 0.8.2.2 > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.
[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025008#comment-15025008 ] Iskandarov Eduard commented on KAFKA-2729: -- Our kafka cluster met the same issue: {noformat} kafka2 1448388319:093 [2015-11-24 21:05:19,387] INFO Partition [dstat_wc_cpl_log,13] on broker 2: Shrinking ISR for partition [dstat_wc_cpl_log,13] from 2,1 to 2 (kafka.cluster.Partition) kafka2 1448388319:094 [2015-11-24 21:05:19,404] INFO Partition [dstat_wc_cpl_log,13] on broker 2: Cached zkVersion [332] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) {noformat} We use confluent.io's kafka distribution. Kafka version is 0.8.2.2. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > --- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.2.1 >Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)