subject:"\[jira\] \[Commented\] \(KAFKA\-2729\) Cached zkVersion not equal to that in zookeeper, broker not recovering."

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2022-07-23 Thread Haifeng Chen (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570326#comment-17570326
 ] 

Haifeng Chen commented on KAFKA-2729:
-

We saw this issue in 1.1 during kafka reconnects to zookeeper. It caused under 
minISR and got recovered in 2 minutes.

 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2022-01-27 Thread Yiming Zang (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483450#comment-17483450
 ] 

Yiming Zang commented on KAFKA-2729:


We are still seeing this issue for 2.7.0, I'm not sure if this issue is 
resolved or not. When this happens, we got partitions under minISR and produce 
requests starts to fail. This is often triggered when a single broker was 
restarted. 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2022-01-10 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472207#comment-17472207
 ] 

Jun Rao commented on KAFKA-2729:


[~mgabriel] : The BadVersion on ZK server just indicates that a conditional 
update has failed. It's the result and not the cause. To understand the cause, 
we need to know if the controller changed the metadata for partition topicXYZ-1 
before the time when Cached zkVersion is reported. You can grep the 
state-change log in the controller to find that out. If the controller didn't 
make the change, you can parse the ZK commit log to see which client updated 
the partition metadata. If the controller did make the change, you can then 
look at the state-change log in node 1 to see if it has received the latest 
partition metadata from the controller.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2022-01-10 Thread Matthias Gabriel (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472112#comment-17472112
 ] 

Matthias Gabriel commented on KAFKA-2729:
-

Hey [~junrao],

We also have the same issue recurring once a week in version 1.1.0, which is 
marked as the "Fix version".

We run a cluster with 3 Kafka Brokers:

Node-1

 
{code:java}
[2021-12-31 19:12:23,540] INFO [Partition topicXYZ-1 broker=1] Shrinking ISR 
from 5,3,1 to 5,1 (kafka.cluster.Partition)
[2021-12-31 19:12:23,544] INFO [Partition topicXYZ-1 broker=1] Cached zkVersion 
[326] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition){code}
 

On Node-2 we do not see any related message for the timeperiod

On Node-3 we have the following message, which we are not sure if its related 
at all.
{code:java}
2021-12-31 19:12:23,541 [myid:3] - INFO  [ProcessThread(sid:3 
cport:-1)::PrepRequestProcessor@653] - Got user-level KeeperException when 
processing sessionid:0x1004521e6ed type:setData cxid:0xbca4 
zxid:0x3a4a372 txntype:-1 reqpath:n/a Error 
Path:/brokers/topics/topicXYZ/partitions/1/state Error:KeeperErrorCode = 
BadVersion for /brokers/topics/topicXYZ/partitions/1/state{code}
Do you have any idea what we could do or which data we could deliver to give 
you additional insights?

Thanks
Matthias

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-08-09 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396210#comment-17396210
 ] 

Jun Rao commented on KAFKA-2729:


[~axrj]: If broker 5's ZK session expires and gets re-established, its broker 
epoch will change. So, it's possible for broker 5 to receive and reject a 
LeaderAndIsr request from the controller temporarily. The question is whether 
broker 5 eventually receives the LeaderAndIsr request when the controller has 
detected the new broker registration. This can be verified from the controller 
and the state-change log.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-07-29 Thread Raj (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389844#comment-17389844
 ] 

Raj commented on KAFKA-2729:


Hi [~junrao] ,

This was just hit in our production as well although I was able to resolve it 
by only restarting the broker that reported errors as opposed to the controller 
or the whole cluster.

Kafka version : 2.3.1

I can confirm the events are identical to what [~l0co]  explained above. 
 * ZK session disconnected on broker 5
 * Replica Fetchers stopped on other brokers
 * ZK Connection re-established on broker 5 after a few seconds
 * Broker 5 came back online and started reporting the "Cached zkVersion[130] 
not equal to..." and shrunk ISRs to only itself

As it didn't recover automatically, I restarted the broker after 30 minutes and 
it then went back to normal.

I did see that the controller tried to send correct metadata to broker 5 but 
which was rejected due to epoch inconsistency.
{noformat}
ERROR [KafkaApi-5] Error when handling request: clientId=21, correlationId=2, 
api=UPDATE_METADATA, 
body={controller_id=21,controller_epoch=53,broker_epoch=223338313060,topic_states=[{topic-a,partition_states=[{partition=0,controller_epoch=53,leader=25,leader_epoch=70,isr=[25,17],zk_version=131,replicas=[5,25,17],offline_replicas=[]}...
...
java.lang.IllegalStateException: Epoch 223338313060 larger than current broker 
epoch 223338311791
at kafka.server.KafkaApis.isBrokerEpochStale(KafkaApis.scala:2612)
at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:194)
at kafka.server.KafkaApis.handle(KafkaApis.scala:117)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
at java.base/java.lang.Thread.run(Thread.java:834)
...
...
...
[2021-07-29 11:07:30,210] INFO [Partition topic-a-0 broker=5] Cached zkVersion 
[130] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
...

{noformat}
 

Preferred leader election error as seen on controller
{noformat}
[2021-07-29 11:11:57,432] ERROR [Controller id=21] Error completing preferred 
replica leader election for partition topic-a-0 
(kafka.controller.KafkaController)
kafka.common.StateChangeFailedException: Failed to elect leader for partition 
topic-a-0 under strategy PreferredReplicaPartitionLeaderElectionStrategy
at 
kafka.controller.ZkPartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:381)
at 
kafka.controller.ZkPartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:378)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:378)
at 
kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:305)
at 
kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:215)
at 
kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145)
at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:646)
at 
kafka.controller.KafkaController$$anonfun$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:995)
at 
kafka.controller.KafkaController$$anonfun$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:976)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at 
kafka.controller.KafkaController.checkAndTriggerAutoLeaderRebalance(KafkaController.scala:976)
at 
kafka.controller.KafkaController.processAutoPreferredReplicaLeaderElection(KafkaController.scala:1004)
at kafka.controller.KafkaController.process(KafkaController.scala:1564)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:53)
at 
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:137)
at 
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:137)
at 
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:137)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at 
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:136)
at 
kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89){noformat}
 

After the restart of broker-5, it was able to take back leadership of the 
desired partitions

 

Kindly let me know if

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-06-30 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372214#comment-17372214
 ] 

Jun Rao commented on KAFKA-2729:


[~l0co] the leaderEpoch doesn't always match the zkVersion. For example, when 
the leader expands/shrinks ISR, it changes the zkVersion, but not the 
leaderEpoch.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-06-29 Thread l0co (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371202#comment-17371202
 ] 

l0co commented on KAFKA-2729:
-

[~junrao] thanks for the reply. Unfortunately from preserved logs from this 
breakdown I only have this useful:
{code:java}
[2021-06-22 14:06:50,637] INFO 1/kafka0/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=0] __consumer_offsets-30 starts at Leader Epoch 
117 from offset 2612283. Previous Leader Epoch was: 116 
(kafka.cluster.Partition)
[2021-06-22 14:07:04,184] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
[2021-06-22 14:07:04,186] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in 
zookeeper, skip updating ISR (kafka.cluster.Partition)
[2021-06-22 14:07:09,146] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
[2021-06-22 14:07:09,147] INFO 1/kafka1/server.log.2021-06-22-14: [Partition 
__consumer_offsets-30 broker=1] Cached zkVersion [212] not equal to that in 
zookeeper, skip updating ISR (kafka.cluster.Partition)
{code}
After the zookeeper reconnection in kafka0, kafka0 becomes the leader with 
epoch 117, and then kafka1 starts to complain that cached zkVersion is not 212, 
which is a greater number. What does it mean for you? We suspect that zookeeper 
of kafka0 has been disconnected from kafka1 and kafka2 zookeepers and 
established its own separate cluster, and then after all zookeepers got back 
into one cluster, it became inconsistent. Does it make sense for you?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-06-28 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370724#comment-17370724
 ] 

Jun Rao commented on KAFKA-2729:


[~l0co], thanks for reporting this. The "Cached zkVersion [212]" error 
indicates the leader epoch was changed by the controller but somehow wasn't 
propagated to the broker. Could you grep for "Partition __consumer_offsets-30" 
in the controller and state-change log and see which controller changed the 
leader epoch corresponding to zk version 212 and whether the controller tried 
to propagate that info to the brokers?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-06-24 Thread l0co (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368661#comment-17368661
 ] 

l0co commented on KAFKA-2729:
-

This problem is certainly not fixed in `1.1.0` as we still experience it with 
this Kafka version. This ticket should be reopened, unless the problem is being 
resolved elsewhere (KAFKA-3042, KAFKA-7888?).

Our scenario is the following: we have `kafka0`, `kafka1` and `kafka2` nodes.

1. `kafka0` loses zookeper connection
{code:java}
WARN Unable to reconnect to ZooKeeper service, session 0x27a31276f6d has 
expired (org.apache.zookeeper.ClientCnxn)
INFO Unable to reconnect to ZooKeeper service, session 0x27a31276f6d has 
expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
INFO EventThread shut down for session: 0x27a31276f6d 
(org.apache.zookeeper.ClientCnxn)
{code}
2. However, a second later the connection is established properly:
{code:java}
[ZooKeeperClient] Initializing a new session to [...] 
(kafka.zookeeper.ZooKeeperClient)
[2021-06-22 14:06:47,838] INFO Opening socket connection to server [...]. Will 
not attempt to authenticate using SASL (unknown error) 
(org.apache.zookeeper.ClientCnxn)
[2021-06-22 14:06:47,873] INFO Socket connection established to [...], 
initiating session (org.apache.zookeeper.ClientCnxn)
[2021-06-22 14:06:47,933] INFO Creating /brokers/ids/0 (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2021-06-22 14:06:47,959] INFO Session establishment complete on server [...], 
sessionid = 0x27a31276f6d0003, negotiated timeout = 6000 
(org.apache.zookeeper.ClientCnxn)
{code}
3. But a few seconds later `ReplicaFetcherThread` is shut down in `kafka0`:
{code:java}
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutting down 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Stopped 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Shutdown completed 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutting down 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Stopped 
(kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Shutdown completed 
(kafka.server.ReplicaFetcherThread)
{code}
We suppose this shutdown is the source of the problem.

4. Now, because of no replication requests from `kafka0` to `kafka1` and 
`kafka2`, `kafka1` and `kafka2` shink ISR list and start to complain about 
zkVersion.
{code:java}
INFO [Partition __consumer_offsets-30 broker=1] Shrinking ISR from 1,2,0 to 1,2 
(kafka.cluster.Partition)
INFO [Partition __consumer_offsets-30 broker=1] Cached zkVersion [212] not 
equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
{code}
This happens forever, until the whole cluster is restarted. Note, that cluster 
state is inconsistent now because `kafka0` stops to be a replica for `kafka1` 
and `kafka2`, but `kafka1` and `kafka2` are still working as replicas for 
`kafka0`. This is due to `ReplicationFetcherThread` has only been stopped in 
`kafka0`.

5. Finally, the whole kafka cluster doesn't work and stops processing events, 
at least for partitions leaded by `kafka0` because of:
{code:java}
ERROR [ReplicaManager broker=0] Error processing append operation on partition 
__consumer_offsets-18 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.NotEnoughReplicasException: Number of insync 
replicas for partition __consumer_offsets-18 is [1], below required minimum [2]
{code}
We also suspect that in this scenario `kafka0` becomes a leader for all 
partitions, but this is not confirmed yet.

 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66]

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-01-05 Thread Jun Rao (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259314#comment-17259314
 ] 

Jun Rao commented on KAFKA-2729:


If you still see this issue, it would be useful to confirm the following.
 # Is "Cached zkVersion" for the same partition long lasting? Transient "Cached 
zkVersion" can happen and is ok.
 # Does the zkVersion for the partition state in ZK match that of the latest 
recorded zkVersion in the controller (logged in the state-change log)? If not, 
this indicates a potential problem in ZK.
 # Otherwise, did the broker receive the latest zkVersion in the leaderAndIsr 
request from the controller (logged in the state-change log)?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2021-01-02 Thread Victor Garcia (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257448#comment-17257448
 ] 

Victor Garcia commented on KAFKA-2729:
--

Yes, it seems this issue is not yet fixed. This should be reopened.

We just had this problem with version 1.1.0

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0, 2.4.1
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2020-10-18 Thread M. Manna (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216323#comment-17216323
 ] 

M. Manna commented on KAFKA-2729:
-

This has resurfaced for us in production environment yesterday and caused an 
outage. Has anyone else seen this issue recently? We are using Confluent 2.4.1, 
but without any customisation. 

It'd be good to know if there are any steps to reproduce this successfully. The 
above mentioned test (network stretch or switching) is quite difficult for us 
to run at the moment. 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Critical
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2019-04-26 Thread Shawn YUAN (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826946#comment-16826946
 ] 

Shawn YUAN commented on KAFKA-2729:
---

Thank you [~DEvil], I'm searching for if any git commits diff patch for 
fixing.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2019-04-23 Thread Alexander Binzberger (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823858#comment-16823858
 ] 

Alexander Binzberger commented on KAFKA-2729:
-

[~evildracula] In an older version of kafka it was reproducable with a very 
high network load.

Possibly when switches/network is at its limits. Maybe in combination with 
dropped packets.

Try to add delay or load to the network and machines, maybe drop some amount of 
packets.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2019-04-15 Thread evildracula (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818083#comment-16818083
 ] 

evildracula commented on KAFKA-2729:


Hello [~junrao], I'm now using 0.11.0.3 which is in affected versions. I would 
like to reproduce this issue in my DEV environment. Could you please help to 
provide reproduce steps? Many thanks.

I'm now reproducing by **systemctl start/stop iptables**  but failed.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-12-13 Thread Jun Rao (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720819#comment-16720819
 ] 

Jun Rao commented on KAFKA-2729:


[~murri71], my earlier comment was referring to that we fixed the following 
issue in KAFKA-7165. It's possible that issue may suffice "Cached zkVersion" in 
some scenarios. I am not sure if there are other issues that can still lead to 
"Cached zkVersion". So, my recommendation is to upgrade to 2.2 when it's 
released and file a separate Jira if the issue still exists.

 
[2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, 
node already exists and owner '3703712903740981258' does not match current 
session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, 
node already exists and owner '3703712903740981258' does not match current 
session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2018-10-05 21:08:28,025] INFO Result of znode creation at /controller is: 
NODEEXISTS (kafka.zk.KafkaZkClient)

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-12-12 Thread JIRA



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719147#comment-16719147
 ] 

Jœl Salmerón Viver commented on KAFKA-2729:
---

[~junrao], this issue is not yet fixed it seems.   We, as others here, are 
experiencing the same loop replication of partitions when trying to delete 
topics via the bin/kakfa-topics command using 1.1 brokers.

If fixed as you say, could someone update the exact broker version where it is 
fixed?

If am torn to upgrade to 2.2 on the brokers, as this bug report does not 
reflect what you imply by your "We fixed another issue" comment above.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-12-04 Thread Jun Rao (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709539#comment-16709539
 ] 

Jun Rao commented on KAFKA-2729:


We fixed another issue that can fail the re-creation of the broker registration 
in ZK in KAFKA-7165 in 2.2.0.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-12-04 Thread Bo Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708393#comment-16708393
 ] 

Bo Wang commented on KAFKA-2729:


I also got the same problem with 1.1.0, whether there is a patch or version 
that solves this problem.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-10-12 Thread adam keyser (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648157#comment-16648157
 ] 

adam keyser commented on KAFKA-2729:


Still seeing it here as well. 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-10-05 Thread Luigi Tagliamonte (JIRA)



[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640382#comment-16640382
 ] 

Luigi Tagliamonte commented on KAFKA-2729:
--

This issue seems not fixed in 1.1

Cluster details:
 * 3 Kafka nodes cluster running 1.1
 * 3 Zookeeper node cluster running 3.4.10

Today meanwhile I was replacing a zookeeper server the leader among the brokers 
experienced this issue:
{code:java}
[2018-10-05 21:03:02,799] INFO [GroupMetadataManager brokerId=1] Removed 0 
expired offsets in 0 milliseconds. 
(kafka.coordinator.group.GroupMetadataManager)
[2018-10-05 21:08:20,060] INFO Unable to read additional data from server 
sessionid 0x34663b434985000e, likely server has closed socket, closing socket 
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-10-05 21:08:21,001] INFO Opening socket connection to server 
10.48.208.70/10.48.208.70:2181. Will not attempt to authenticate using SASL 
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2018-10-05 21:08:21,003] WARN Session 0x34663b434985000e for server null, 
unexpected error, closing socket connection and attempting reconnect 
(org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[2018-10-05 21:08:21,797] INFO Opening socket connection to server 
10.48.210.44/10.48.210.44:2181. Will not attempt to authenticate using SASL 
(unknown error) (org.apache.zookeeper.ClientCnxn)
[2018-10-05 21:08:21,799] INFO Socket connection established to 
10.48.210.44/10.48.210.44:2181, initiating session 
(org.apache.zookeeper.ClientCnxn)
[2018-10-05 21:08:21,802] INFO Session establishment complete on server 
10.48.210.44/10.48.210.44:2181, sessionid = 0x34663b434985000e, negotiated 
timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2018-10-05 21:08:28,015] INFO Creating /controller (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2018-10-05 21:08:28,015] INFO Creating /controller (is it secure? false) 
(kafka.zk.KafkaZkClient)
[2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, 
node already exists and owner '3703712903740981258' does not match current 
session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2018-10-05 21:08:28,025] ERROR Error while creating ephemeral at /controller, 
node already exists and owner '3703712903740981258' does not match current 
session '3775770497779040270' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2018-10-05 21:08:28,025] INFO Result of znode creation at /controller is: 
NODEEXISTS (kafka.zk.KafkaZkClient)
[2018-10-05 21:08:28,025] INFO Result of znode creation at /controller is: 
NODEEXISTS (kafka.zk.KafkaZkClient)
[2018-10-05 21:08:42,561] INFO [Partition -store-changelog-7 broker=1] 
Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition)
[2018-10-05 21:08:42,561] INFO [Partition -store-changelog-7 broker=1] 
Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition)
[2018-10-05 21:08:42,569] INFO [Partition -store-changelog-7 broker=1] 
Cached zkVersion [11] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2018-10-05 21:08:42,569] INFO [Partition -store-changelog-7 broker=1] 
Cached zkVersion [11] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2018-10-05 21:08:42,569] INFO [Partition bycontact_0-19 broker=1] 
Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition)
[2018-10-05 21:08:42,569] INFO [Partition bycontact_0-19 broker=1] 
Shrinking ISR from 2,1,3 to 1 (kafka.cluster.Partition)
[2018-10-05 21:08:42,574] INFO [Partition bycontact_0-19 broker=1] 
Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2018-10-05 21:08:42,574] INFO [Partition bycontact_0-19 broker=1] 
Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition){code}
 The only way in order to recover was to restart the broker.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-01-26 Thread Oleksiy Stashok (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341892#comment-16341892
 ] 

Oleksiy Stashok commented on KAFKA-2729:


Thank you, that really helped!

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-01-26 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341863#comment-16341863
 ] 

Jun Rao commented on KAFKA-2729:


The problem that we fixed related to this jira is KAFKA-5642. Previously, when 
the controller's ZK session expires and loses its controller-ship, it's 
possible for this zombie controller to continue updating ZK and/or sending 
LeaderAndIsrRequests to the brokers for a short period of time. When this 
happens, the broker may not have the most up-to-date information about leader 
and isr, which can lead to subsequent ZK failure when isr needs to be updated. 
KAFKA-5642 fixes the issue by handling the ZK session expiration event properly.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-01-26 Thread Oleksiy Stashok (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341725#comment-16341725
 ] 

Oleksiy Stashok commented on KAFKA-2729:


[~ijuma] can you please provide more information on the issue you guys fixed, 
because here people report issues, which may or may not be related, so it would 
be good to understand what exactly you guys were able to reproduce and fixed.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2018-01-22 Thread Martin Nowak (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334281#comment-16334281
 ] 

Martin Nowak commented on KAFKA-2729:
-

Confirmed that for me all occurrences of this issue were preceded by ZK session 
timeouts/expirations. Looking forward to have this finally fixed.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>Assignee: Onur Karaman
>Priority: Major
> Fix For: 1.1.0
>
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-26 Thread Francesco vigotti (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220284#comment-16220284
 ] 

Francesco vigotti commented on KAFKA-2729:
--

I've maybe found the problem to my issue which maybe is not related to this 
topic because in my case simple broker restart didn't worked, I've create a 
dedicated issue then... https://issues.apache.org/jira/browse/KAFKA-6129


> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-13 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204367#comment-16204367
 ] 

Jun Rao commented on KAFKA-2729:


[~fravigotti], sorry to hear that. A couple of quick suggestions.

(1) Do you see any ZK session expiration in the log (e.g., INFO zookeeper state 
changed (Expired) (org.I0Itec.zkclient.ZkClient))? There are known bugs in 
Kafka in handling ZK session expiration. So, it would be useful to avoid it in 
the first place. Typical causes of ZK session expiration are long GC in the 
broker or network glitches. So you can either tune the broker or increase 
zookeeper.session.timeout.ms.

(2) Do you have lots of partitions (say a few thousands) per broker? If so, you 
want to check if the controlled shutdown succeeds when shutting down a broker. 
If not, restarting the broker too soon could also lead the cluster to a weird 
state. To address this issue, you can increase request.timeout.ms on the broker.

We are fixing the known issue in (1) and improving the performance with lots of 
partitions in (2) in KAFKA-5642 and we expect the fix to be included in the 
1.1.0 release in Feb.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-13 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203772#comment-16203772
 ] 

Ismael Juma commented on KAFKA-2729:


If you're seeing the issue this often, then there's most likely a configuration 
issue. If you file a separate issue with all the logs (including GC logs) and 
configs (broker and ZK), maybe someone can help.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-13 Thread Francesco vigotti (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203749#comment-16203749
 ] 

Francesco vigotti commented on KAFKA-2729:
--

After having lost 2 days on this I've reset whole cluster, stopped all kafka 
brokers, stopped zookeeper cluster, delete all directories,stopped all consumer 
and producer ,then restarted everything , recreated topics and now guess what? 
:)

one node reports... 
{code:java}

[2017-10-13 15:54:52,893] INFO Partition [__consumer_offsets,5] on broker 2: 
Expanding ISR for partition __consumer_offsets-5 from 10,13,2 to 10,13,2,5 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,906] INFO Partition [__consumer_offsets,5] on broker 2: 
Cached zkVersion [13] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,908] INFO Partition [__consumer_offsets,25] on broker 2: 
Expanding ISR for partition __consumer_offsets-25 from 10,2,13 to 10,2,13,5 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,915] INFO Partition [__consumer_offsets,25] on broker 2: 
Cached zkVersion [10] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,916] INFO Partition [__consumer_offsets,45] on broker 2: 
Expanding ISR for partition __consumer_offsets-45 from 10,13,2 to 10,13,2,5 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,925] INFO Partition [__consumer_offsets,45] on broker 2: 
Cached zkVersion [15] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,926] INFO Partition [__consumer_offsets,5] on broker 2: 
Expanding ISR for partition __consumer_offsets-5 from 10,13,2 to 10,13,2,5 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,936] INFO Partition [__consumer_offsets,5] on broker 2: 
Cached zkVersion [13] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-10-13 15:54:52,939] INFO Partition [__consumer_offsets,25] on broker 2: 
Expanding ISR for partition __consumer_offsets-25 from 10,2,13 to 10,2,13,5 
(kafka.cluster.Partition)
{code}

while others 


{code:java}
[2017-10-13 15:57:08,128] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,40] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 15:57:09,129] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,40] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 15:57:10,260] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,40] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 15:57:11,262] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,40] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 15:57:12,265] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,40] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 15:57:13,289] ERROR [ReplicaFetcherThread-0-2], Error fo
{code}


cluster still being inconsistent, I've also added 2 more nodes hoping in an 
increasing of stability but nothing, I don't know if something is wrong because 
if kafka do some kind of pre-flight checks during startup it does log nothing.. 
the only logs are those which have no sense because the leader should be 
re-elected when there are ISR available.. and there are 
I've started looking for an alternative software to  use, I'm trying to use 
kafka is so frustrating :(


> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-13 Thread Francesco vigotti (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203331#comment-16203331
 ] 

Francesco vigotti commented on KAFKA-2729:
--

At the beginning of my cluster screw up I've got tons of zkVersion issue that's 
why I've posted here , but because seems that the problems for you goes away 
when you restarted your brokers maybe my problem is different.. 
kafka version : 0.10.2.1

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-13 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203244#comment-16203244
 ] 

Ismael Juma commented on KAFKA-2729:


[~fravigotti], none of your log messages seems to be about the zkVersion issue, 
is it really the same issue as this one? If not, you should file a separate 
JIRA including the Kafka version.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-13 Thread Francesco vigotti (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203198#comment-16203198
 ] 

Francesco vigotti commented on KAFKA-2729:
--

I'm having the same issue and definitely losing trust in kafka, every 2 months 
there is something that force me to reset the whole cluster, I'm searching for 
a good alternative for a distributed-persisted-fast-queue for a while.. yet to 
find something that give me a good vibe.. 

anyway I'm facing this same issue with some small differences
- restarting all brokers ( together and rolling-restart ) didn't fix it..

all brokers in the cluster log such errors :
--- broker 5 

{code:java}

[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,17] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,23] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,47] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2017-10-13 08:13:57,429] ERROR [ReplicaFetcherThread-0-2], Error for partition 
[__consumer_offsets,29] to broker 
2:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)

{code}

--- broker3

)
{code:java}

[2017-10-13 08:13:58,547] INFO Partition [__consumer_offsets,20] on broker 3: 
Expanding ISR for partition __consumer_offsets-20 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,551] INFO Partition [__consumer_offsets,44] on broker 3: 
Expanding ISR for partition __consumer_offsets-44 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,554] INFO Partition [__consumer_offsets,5] on broker 3: 
Expanding ISR for partition __consumer_offsets-5 from 2,3 to 2,3,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,557] INFO Partition [__consumer_offsets,26] on broker 3: 
Expanding ISR for partition __consumer_offsets-26 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,563] INFO Partition [__consumer_offsets,29] on broker 3: 
Expanding ISR for partition __consumer_offsets-29 from 2,3 to 2,3,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,566] INFO Partition [__consumer_offsets,32] on broker 3: 
Expanding ISR for partition __consumer_offsets-32 from 3,2 to 3,2,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,570] INFO Partition [legacyJavaVarT,2] on broker 3: 
Expanding ISR for partition legacyJavaVarT-2 from 3 to 3,5 
(kafka.cluster.Partition)
[2017-10-13 08:13:58,573] INFO Partition [test4,3] on broker 3: Expanding ISR 
for partition test4-3 from 2,3 to 2,3,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,577] INFO Partition [test4,0] on broker 3: Expanding ISR 
for partition test4-0 from 3,2 to 3,2,5 (kafka.cluster.Partition)
[2017-10-13 08:13:58,582] INFO Partition [test3,5] on broker 3: Expanding ISR 
for partition test3-5 from 3 to 3,5 (kafka.cluster.Partition)

{code}


--- broker2 

{code:java}

[2017-10-13 08:13:36,289] INFO Partition [__consumer_offsets,11] on broker 2: 
Expanding ISR for partition __consumer_offsets-11 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,293] INFO Partition [__consumer_offsets,41] on broker 2: 
Expanding ISR for partition __consumer_offsets-41 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,296] INFO Partition [test3,2] on broker 2: Expanding ISR 
for partition test3-2 from 2 to 2,3 (kafka.cluster.Partition)
[2017-10-13 08:13:36,300] INFO Partition [__consumer_offsets,23] on broker 2: 
Expanding ISR for partition __consumer_offsets-23 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,304] INFO Partition [__consumer_offsets,5] on broker 2: 
Expanding ISR for partition __consumer_offsets-5 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,337] INFO Partition [__consumer_offsets,35] on broker 2: 
Expanding ISR for partition __consumer_offsets-35 from 2,5 to 2,5,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,372] INFO Partition [test_mainlog,24] on broker 2: 
Expanding ISR for partition test_mainlog-24 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,375] INFO Partition [test_mainlog,6] on broker 2: 
Expanding ISR for partition test_mainlog-6 from 2 to 2,3 
(kafka.cluster.Partition)
[2017-10-13 08:13:36,379] INFO Partition [test_mainlog,18] on broker 2: 
Expanding ISR for partition test_mainlog-18 from 2 to 2,3

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-10-09 Thread sumit jain (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196853#comment-16196853
 ] 

sumit jain commented on KAFKA-2729:
---

Facing the same issue.. here's the question I asked on stack overflow 
https://stackoverflow.com/questions/46644764/kafka-cached-zkversion-not-equal-to-that-in-zookeeper-broker-not-recovering

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1, 0.9.0.0, 0.10.0.0, 0.10.1.0, 0.11.0.0
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-08-29 Thread Peter Davis (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145648#comment-16145648
 ] 

Peter Davis commented on KAFKA-2729:


[~junrao] Per your previous 
[comment](https://issues.apache.org/jira/browse/KAFKA-2729?focusedCommentId=16107042=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16107042),
 is this issue definitely covered under 
[KAFKA-5027](https://issues.apache.org/jira/browse/KAFKA-5027) then?  It is not 
linked there.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-08-17 Thread Joseph Aliase (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131341#comment-16131341
 ] 

Joseph Aliase commented on KAFKA-2729:
--

Have happened to us twice in Prod. Restart seems to be a only solution. 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-07-31 Thread Dan (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106965#comment-16106965
 ] 

Dan commented on KAFKA-2729:


Happened in 0.11.0.0 as well. Had to restart the broker to bring it back to 
operational state.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-06-22 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060110#comment-16060110
 ] 

Jun Rao commented on KAFKA-2729:


[~timoha], we are trying to address the ZK session expiration issue in the 
controller improvement work under 
https://issues.apache.org/jira/browse/KAFKA-5027.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-06-22 Thread Andrey Elenskiy (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060049#comment-16060049
 ] 

Andrey Elenskiy commented on KAFKA-2729:


Seeing the same issue on 0.10.2. 

A node running zookeeper lost networking for split second and initiated an 
election which caused some sessions to expire with:
```
2017-06-22 02:07:36,092 [myid:3] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@373] - Exception 
causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
running
```
which caused controller resignation:
```
[2017-06-22 02:07:36,363] INFO [SessionExpirationListener on 158980], ZK 
expired; shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: Controller resigning, 
broker id 158980 (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] DEBUG [Controller 158980]: De-registering 
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2017-06-22 02:07:37,028] INFO [Partition state machine on Controller 158980]: 
Stopped partition state machine (kafka.controller.PartitionStateMachine)
[2017-06-22 02:07:37,028] INFO [Replica state machine on controller 158980]: 
Stopped replica state machine (kafka.controller.ReplicaStateMachine)
[2017-06-22 02:07:37,028] INFO [Controller 158980]: Broker 158980 resigned as 
the controller (kafka.controller.KafkaController)
```
and after that just kept getting this in broker's server logs for next 8 hours 
until just restarting manually it:
```
[2017-06-22 17:41:06,928] INFO Partition [A,5] on broker 158980: Shrinking ISR 
for partition [A,5] from 158980,133641,155394 to 158980 
(kafka.cluster.Partition)
[2017-06-22 17:41:06,935] INFO Partition [A,5] on broker 158980: Cached 
zkVersion [73] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
```

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

39 matches

Mail list logo