[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-06-08 Thread Pablo (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042543#comment-16042543
 ] 

Pablo commented on KAFKA-2729:
--

Guys, this issue is not only affecting to 0.8.2.1 as many people here are 
saying. We had this same problem on a 0.10.2 during an upgrade from 0.8.2, we 
workaround it increasing the zk session and connection timeouts and worked 
fine, but we don't feel very safe.

I suggest to add all the affected versions people are saying.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-04-13 Thread Edoardo Comar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967735#comment-15967735
 ] 

Edoardo Comar commented on KAFKA-2729:
--

FWIW - we saw the same message 
{{  Cached zkVersion [66] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition) }}

when redeploying kafka 0.10.0.1 in a cluster after we had run 0.10.2.0
after having wiped kafka's storage, but having kept zookeeper's version (the 
one bundled with kafka 0.10.2) and its storage

For us eventually the cluster recovered.
HTH.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-04-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967705#comment-15967705
 ] 

Jun Rao commented on KAFKA-2729:


Thanks for the additional info. In both [~Ronghua Lin] and [~allenzhuyi]'s 
case, it seems ZK session expiration had happened. As I mentioned earlier in 
the jira, there is a known issue reported in KAFKA-3083 that when the 
controller's ZK session expires and loses its controller-ship, it's possible 
for this zombie controller to continue updating ZK and/or sending 
LeaderAndIsrRequests to the brokers for a short period of time. When this 
happens, the broker may not have the most up-to-date information about leader 
and isr, which can lead to subsequent ZK failure when isr needs to be updated.

It may take some time to have this issue fixed. In the interim, the workaround 
for this issue is to make sure ZK session expiration never happens. This first 
thing is to figure out what's causing the ZK session to expire. Two common 
causes are (1) long broker GC and (2) network glitches. For (1), one needs to 
tune the GC in the broker properly. For (2), one can look at the reported time 
that the ZK client can't hear from the ZK server and increase the ZK session 
expiration time according.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-04-09 Thread allenzhuyi (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962085#comment-15962085
 ] 

allenzhuyi commented on KAFKA-2729:
---

we see the bug in kafka_2.12-0.10.2.0,when there is long lantency ping 
timeout.We have 3 brokers,2 Brokes find the error below
[
Partition [__consumer_offsets,9] on broker 3: Shrinking ISR for partition 
[__consumer_offsets,9] from 3,1,2 to 3,2 (kafka.cluster.Paition)
Partition [__consumer_offsets,9] on broker 3: Cached zkVersion [89] not equal 
to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
]
but another one  broker does'n realize the bug,the producer continuly report 
timeout exception and send message fail,Until finally  the broker zk session 
expired and re-registering broker info in ZK for broker. The System is 
recovered.
It is a serious bug,Please help us to solve it quickly.
Thank you.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-03-29 Thread Sam Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947500#comment-15947500
 ] 

Sam Nguyen commented on KAFKA-2729:
---

We ran into this today on kafka_2.11-0.10.0.1.

There is unexpected behavior with regards to partition availability.  One out 
of 3 total brokers in our cluster entered this state (emitting "Cached 
zkVersion [140] not equal to that in zookeeper, skip updating ISR" errors).  

We have our producer "required acks" config set to wait for all (-1), and the 
min.insync.replicas set to 2.  I would have expected to be able to still be 
able to produce to the topic, but our producer (sarama) was getting timeouts.  
After restarting the broken broker, we were able to continue producing.

I confirmed that even after performing a graceful shutdown on 1 out of 3 
brokers, we are still able to produce since we have 2 out of 3 brokers still 
alive to serve produce and acknowledge produce requests.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-03-22 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937468#comment-15937468
 ] 

Stephane Maarek commented on KAFKA-2729:


If I may add, this is a pretty bad issue, but it got worse. You not only have 
to recover Kafka, but also recover your Kafka Connect ClusterS. They got stuck 
for me in the following state:

[2017-03-23 00:06:05,478] INFO Marking the coordinator kafka-1:9092 (id: 
2147483626 rack: null) dead for group connect-MyConnector 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2017-03-23 00:06:05,478] INFO Marking the coordinator kafka-1:9092 (id: 
2147483626 rack: null) dead for group connect-MyConnector 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator)

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-03-20 Thread Ronghua Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933991#comment-15933991
 ] 

Ronghua Lin commented on KAFKA-2729:


[~junrao], we also have this problem in a small cluster which has 3 brokers, 
running Kafka 0.10.1.1. When it happened, the logs of each broker look like 
this:
{code:title=broker 2 | borderStyle=solid}
[2017-03-20 01:03:48,903] INFO [Group Metadata Manager on Broker 2]: Removed 0 
expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2017-03-20 01:13:27,283] INFO Creating /controller (is it secure? false) 
(kafka.utils.ZKCheckedEphemeral)
[2017-03-20 01:13:27,293] INFO Result of znode creation is: OK 
(kafka.utils.ZKCheckedEphemeral)
[2017-03-20 01:13:27,294] INFO 2 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2017-03-20 01:13:28,203] INFO re-registering broker info in ZK for broker 2 
(kafka.server.KafkaHealthcheck$SessionExpireListener)
[2017-03-20 01:13:28,205] INFO Creating /brokers/ids/2 (is it secure? false) 
(kafka.utils.ZKCheckedEphemeral)
[2017-03-20 01:13:28,218] INFO Result of znode creation is: OK 
(kafka.utils.ZKCheckedEphemeral)
[2017-03-20 01:13:28,219] INFO Registered broker 2 at path /brokers/ids/2 with 
addresses: PLAINTEXT -> EndPoint(x, ,PLAINTEXT) (kafka.utils.ZkUtils)
[2017-03-20 01:13:28,219] INFO done re-registering broker 
(kafka.server.KafkaHealthcheck$SessionExpireListener)
[2017-03-20 01:13:28,220] INFO Subscribing to /brokers/topics path to watch for 
new topics (kafka.server.KafkaHealthcheck$SessionExpireListener)
[2017-03-20 01:13:28,224] INFO New leader is 2 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2017-03-20 01:13:28,227] INFO New leader is 2 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2017-03-20 01:13:38,812] INFO Partition [topic1,1] on broker 2: Shrinking ISR 
for partition [topic1,1] from 0,2,1 to 2,1 (kafka.cluster.Partition)
[2017-03-20 01:13:38,825] INFO Partition [topic1,1] on broker 2: Cached 
zkVersion [6] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,825] INFO Partition [topic2,1] on broker 2: Shrinking ISR 
for partition [topic2,1] from 0,2,1 to 2,1 (kafka.cluster.Partition)
[2017-03-20 01:13:38,835] INFO Partition [topic2,1] on broker 2: Cached 
zkVersion [6] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,835] INFO Partition [topic3,0] on broker 2: Shrinking ISR 
for partition [topic3,0] from 0,2,1 to 2,1 (kafka.cluster.Partition)
[2017-03-20 01:13:38,847] INFO Partition [topic3,0] on broker 2: Cached 
zkVersion [6] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)

{code}

{code:title=broker 1 | borderStyle=solid}
[2017-03-20 01:03:38,255] INFO [Group Metadata Manager on Broker 1]: Removed 0 
expired offsets in 0 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2017-03-20 01:13:27,451] INFO New leader is 2 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2017-03-20 01:13:27,490] INFO re-registering broker info in ZK for broker 1 
(kafka.server.KafkaHealthcheck$SessionExpireListener)
[2017-03-20 01:13:27,491] INFO Creating /brokers/ids/1 (is it secure? false) 
(kafka.utils.ZKCheckedEphemeral)
[2017-03-20 01:13:27,503] INFO Result of znode creation is: OK 
(kafka.utils.ZKCheckedEphemeral)
[2017-03-20 01:13:27,503] INFO Registered broker 1 at path /brokers/ids/1 with 
addresses: PLAINTEXT -> EndPoint(,,PLAINTEXT) (kafka.utils.ZkUtils)
[2017-03-20 01:13:27,504] INFO done re-registering broker 
(kafka.server.KafkaHealthcheck$SessionExpireListener)
[2017-03-20 01:13:27,504] INFO Subscribing to /brokers/topics path to watch for 
new topics (kafka.server.KafkaHealthcheck$SessionExpireListener)
[2017-03-20 01:13:27,508] INFO New leader is 2 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2017-03-20 01:13:38,134] INFO Partition [__consumer_offsets,40] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,40] from 1,0 to 1 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,155] INFO Partition [__consumer_offsets,40] on broker 1: 
Cached zkVersion [2] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,156] INFO Partition [__consumer_offsets,0] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,0] from 1,0 to 1 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,161] INFO Partition [__consumer_offsets,0] on broker 1: 
Cached zkVersion [2] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,162] INFO Partition [__consumer_offsets,12] on broker 1: 
Shrinking ISR for partition [__consumer_offsets,12] from 1,0 to 1 
(kafka.cluster.Partition)
[2017-03-20 01:13:38,170] INFO Partition [__consumer_offsets,12] on broker 1: 
Cached zkVersion [2] not equal to that in zookeeper, skip 

Re: [jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-03-08 Thread Marcos Juarez
Jun,

I see that line elsewhere in the cluster.  I don't see it happening on that
particular broker that ran into the problem.

On Mon, Mar 6, 2017 at 5:02 PM, Jun Rao (JIRA)  wrote:

>
> [ https://issues.apache.org/jira/browse/KAFKA-2729?page=
> com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel=15898415#comment-15898415 ]
>
> Jun Rao commented on KAFKA-2729:
> 
>
> [~mjuarez], did you see ZK session expiration in the server.log in the
> controller around that time? The log will look like the following.
>
> INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient)
>
> > Cached zkVersion not equal to that in zookeeper, broker not recovering.
> > ---
> >
> > Key: KAFKA-2729
> > URL: https://issues.apache.org/jira/browse/KAFKA-2729
> > Project: Kafka
> >  Issue Type: Bug
> >Affects Versions: 0.8.2.1
> >Reporter: Danil Serdyuchenko
> >
> > After a small network wobble where zookeeper nodes couldn't reach each
> other, we started seeing a large number of undereplicated partitions. The
> zookeeper cluster recovered, however we continued to see a large number of
> undereplicated partitions. Two brokers in the kafka cluster were showing
> this in the logs:
> > {code}
> > [2015-10-27 11:36:00,888] INFO Partition 
> > [__samza_checkpoint_event-creation_1,3]
> on broker 5: Shrinking ISR for partition 
> [__samza_checkpoint_event-creation_1,3]
> from 6,5 to 5 (kafka.cluster.Partition)
> > [2015-10-27 11:36:00,891] INFO Partition 
> > [__samza_checkpoint_event-creation_1,3]
> on broker 5: Cached zkVersion [66] not equal to that in zookeeper, skip
> updating ISR (kafka.cluster.Partition)
> > {code}
> > For all of the topics on the effected brokers. Both brokers only
> recovered after a restart. Our own investigation yielded nothing, I was
> hoping you could shed some light on this issue. Possibly if it's related
> to: https://issues.apache.org/jira/browse/KAFKA-1382 , however we're
> using 0.8.2.1.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.15#6346)
>


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-03-06 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898415#comment-15898415
 ] 

Jun Rao commented on KAFKA-2729:


[~mjuarez], did you see ZK session expiration in the server.log in the 
controller around that time? The log will look like the following.

INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient)

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-03-01 Thread mjuarez (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891355#comment-15891355
 ] 

mjuarez commented on KAFKA-2729:


We are also running into this problem in our staging cluster, running Kafka 
0.10.0.1.  Basically it looks like this happened yesterday: 

{noformat}
[2017-02-28 18:41:33,513] INFO Client session timed out, have not heard from 
server in 7799ms for sessionid 0x159d7893eab0088, closing socket connection and 
attempting reconnect (org.apache.zookeeper.ClientCnxn)
{noformat}

I'm attributing that to a transient network issue, since we haven't seen any 
other issues.  And less than a minute later, we started seeing these errors:

{noformat}
[2017-02-28 18:42:45,739] INFO Partition 
[analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Shrinking ISR 
for partition [analyticsInfrastructure_KafkaAvroUserMessage,16] from 
102,101,105 to 101 (kaf
[2017-02-28 18:42:45,751] INFO Partition 
[analyticsInfrastructure_KafkaAvroUserMessage,16] on broker 101: Cached 
zkVersion [94] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-02-28 18:42:45,751] INFO Partition 
[qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Shrinking ISR for 
partition [qa_exporter11_slingshot_salesforce_invoice,6] from 101,105,104 to 
101 (kafka.clu
[2017-02-28 18:42:45,756] INFO Partition 
[qa_exporter11_slingshot_salesforce_invoice,6] on broker 101: Cached zkVersion 
[237] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-02-28 18:42:45,756] INFO Partition [GNRDEV_counters_singleCount,2] on 
broker 101: Shrinking ISR for partition [GNRDEV_counters_singleCount,2] from 
101,105,104 to 101 (kafka.cluster.Partition)
[2017-02-28 18:42:45,761] INFO Partition [GNRDEV_counters_singleCount,2] on 
broker 101: Cached zkVersion [334] not equal to that in zookeeper, skip 
updating ISR (kafka.cluster.Partition)
[2017-02-28 18:42:45,761] INFO Partition [sod-spins-spark-local,1] on broker 
101: Shrinking ISR for partition [sod-spins-spark-local,1] from 101,103,104 to 
101 (kafka.cluster.Partition)
[2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,1] on broker 
101: Cached zkVersion [379] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2017-02-28 18:42:45,764] INFO Partition [sod-spins-spark-local,11] on broker 
101: Shrinking ISR for partition [sod-spins-spark-local,11] from 102,101,105 to 
101 (kafka.cluster.Partition)
[2017-02-28 18:42:45,767] INFO Partition [sod-spins-spark-local,11] on broker 
101: Cached zkVersion [237] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
{noformat}

The "current" server is 101.  So it thinks it's the leader for basically every 
partition on that node, but it's refusing to update the ISRs, because the 
cached zkversion doesn't match the one in zookeeper.  This is causing 
permanently under-replicated partitions, because server doesn't ever catch up, 
since it doesn't think there's a problem.  Also, the metadata reported by the 
101 server to consumers indicates it thinks it's part of the ISR, but every 
other broker doesn't think so.

Let me know if more logs/details would be helpful.  I'll try to fix this by 
restarting the node, and hopefully it fixes the issue.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-23 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880817#comment-15880817
 ] 

Jun Rao commented on KAFKA-2729:


[~prasincs], if the controller is partitioned off other brokers and ZK, the 
expected flow is the following: (1) ZK server detects that the old controller's 
session expires; (2) the controller path is removed by ZK; (3) a new controller 
is elected and changes leaders/isrs; (4) network is back on the old controller; 
(5) old controller receives ZK session expiration event; (6) old controller 
stops doing the controller stuff and resign. Note that the old controller 
doesn't really know that it's no longer the controller until step (5). The gap 
we have now is that step (6) is not done in a timely fashion.

Are you deploying Kafka in the same data center? What kind of network 
partitions are you seeing? Typically, we expect network partitions are rare 
within the same data center. If there are short network glitches, one temporary 
fix is to increase the ZK session timeout to accommodate for that until the 
network issue is fixed.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Prasanna Gautam (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879885#comment-15879885
 ] 

Prasanna Gautam commented on KAFKA-2729:


[~junrao] Thanks for looking into this. Do you mind elaborating on what you 
need to change in the ZK api and whether 
https://issues.apache.org/jira/browse/KAFKA-3083 is going to solve it. The 
issue here is that -- in case of network partitions which are unrelated to the 
3 points and can happen at any time, this can happen and leave the brokers in 
messed up state until a restart. Can this be fixed by handling the ZK 
Connection errors?

If restarting the broker is the only fix, maybe the proper thing to do is to 
crash and let supervisor, etc. restart the service?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879700#comment-15879700
 ] 

Jun Rao commented on KAFKA-2729:


Sorry to hear about the impact to production. Grant mentioned ZK session 
expiration, which is indeed a potential cause of this issue. A related issue 
has been reported in KAFKA-3083. The issue is that when the controller's ZK 
session expires and loses its controller-ship, it's possible for this zombie 
controller to continue updating ZK and/or sending LeaderAndIsrRequests to the 
brokers for a short period of time. When this happens, the broker may not have 
the most up-to-date information about leader and isr, which can lead to 
subsequent ZK failure when isr needs to be updated.

Fixing this issue requires us change the way how we use the ZK api and may take 
some time. In the interim, one suggestion is to make sure ZK session expiration 
never happens. This can be achieved by making sure that (1) ZK servers are 
performing, (2) the brokers don't have long GCs, (3) the ZK session expiration 
time is large enough.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Kane Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879554#comment-15879554
 ] 

Kane Kim commented on KAFKA-2729:
-

In my opinion it doesn't matter what's causing it (in our case that was indeed 
lost packets to zookeeper), the culprit is that brokers will not recover by 
itself until rolling restart. This is real problem and has to be fixed.


> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Prateek Jaipuria (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879358#comment-15879358
 ] 

Prateek Jaipuria commented on KAFKA-2729:
-

[~granthenke] We don't see any zookeeper disconnections.

Just
{code}
INFO Partition [topic,n] on broker m: Cached zkVersion [xxx] not equal to that 
in zookeeper, skip updating ISR (kafka.cluster.Partition)
{code}

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Dave Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879213#comment-15879213
 ] 

Dave Thomas commented on KAFKA-2729:


[~granthenke] We don't see brokers recovering.  The message we see is:
{noformat}
Cached zkVersion [xxx] not equal to that in zookeeper, skip updating ISR
{noformat}



> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Grant Henke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879124#comment-15879124
 ] 

Grant Henke commented on KAFKA-2729:


I am curious if everyone on this Jira is actually seeing the reported issue. I 
have had multiple cases where someone presented my with an environment they 
thought was experiencing this issue. After researching the environment and 
logs, to date it has always been something else. 

The main culprits so far have been:
* Long GC pauses causing zookeeper sessions to timeout
* Slow or poorly configured zookeeper
* Bad network configuration

All of the above resulted in a soft reoccurring failure of brokers. That churn 
often caused addition load perpetuating the issue. 

If you are seeing this issue do you see the following pattern repeating in the 
logs?:
{noformat}
INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (Disconnected)
...
INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (Expired)
INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, 
session 0x153ab38abdbd360 has expired, closing socket connection
...
INFO org.I0Itec.zkclient.ZkClient: zookeeper state changed (SyncConnected)
INFO kafka.server.KafkaHealthcheck: re-registering broker info in ZK for broker 
32
INFO kafka.utils.ZKCheckedEphemeral: Creating /brokers/ids/32 (is it secure? 
false)
INFO kafka.utils.ZKCheckedEphemeral: Result of znode creation is: OK
{noformat}

If so, something is causing communication with zookeeper to take too long and 
the broker is unregistering itself. This will cause ISRs to shrink and expand 
over and over again.

I don't think this will solve everyones issue here, but hopefully it will help 
solve some.



> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-22 Thread Prateek Jaipuria (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879076#comment-15879076
 ] 

Prateek Jaipuria commented on KAFKA-2729:
-

Having the same issue with 0.10.1.0 on a 8 node cluster. Restarting node also 
does not help, the problem just moves on to another node. This is becoming a 
deal breaker. Definitely losing trust in Kafka. Definitely a BLOCKER!

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869703#comment-15869703
 ] 

Vladimír Kleštinec commented on KAFKA-2729:
---

[~elevy] Agree, we are experiencing same issue, this is real blocker and we are 
loosing trust in Kafka...

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-15 Thread Elias Levy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868971#comment-15868971
 ] 

Elias Levy commented on KAFKA-2729:
---

Hit this again during testing with 0.10.0.1 on a 10 node broker cluster with a 
3 node ZK ensemble.  This should have priority Blocker instead of Major.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-14 Thread Prasanna Gautam (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865486#comment-15865486
 ] 

Prasanna Gautam commented on KAFKA-2729:


This is still replicable in Kafka 0.10.1.1 when Kafka brokers are partitioned 
from each other and zookeeper gets disconnected from the brokers briefly and 
comes back. This situation leads to brokers getting stuck in comparing Cached 
zkVersion and unable to expand the ISR. 

The code in Partition.scala does not seem to be handling enough error 
conditions other than the stale zkVersion. In addition to skipping in the 
current loop, I think it should reconnect to zookeeper to update the current 
state and version. 

Here's a suggestion to do this.. doing it asynchronously doesn't break the flow 
and you can update the state. ZkVersion may not be the only thing to update 
here.

{code}
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(r 
=> r.brokerId).toList, zkVersion)
val (updateSucceeded,newVersion) = 
ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partitionId,
  newLeaderAndIsr, controllerEpoch, zkVersion)

if(updateSucceeded) {
  replicaManager.recordIsrChange(new TopicAndPartition(topic, partitionId))
  inSyncReplicas = newIsr
  zkVersion = newVersion
  trace("ISR updated to [%s] and zkVersion updated to 
[%d]".format(newIsr.mkString(","), zkVersion))
} else {
  info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating 
ISR".format(zkVersion))
  zkVersion = asyncUpdateTopicPartitionVersion(topic,partitionId)
}
{code}

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-02-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860948#comment-15860948
 ] 

Sinóros-Szabó Péter commented on KAFKA-2729:


Do you have any plan to resolve this? Or is there a workaround for this issue?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-01-26 Thread Dave Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840494#comment-15840494
 ] 

Dave Thomas commented on KAFKA-2729:


Same with us, on 0.10.1.1 (following upgrade from 0.10.1.0 where we saw the 
same issue).

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2017-01-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826905#comment-15826905
 ] 

Sinóros-Szabó Péter commented on KAFKA-2729:


Same issue on 0.10.1.1. Do you need logs? I can collect them next time I see 
this.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-11-23 Thread derek (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690420#comment-15690420
 ] 

derek commented on KAFKA-2729:
--

I'm on 0.10.1.0 and seeing the same thing. Maybe related to [~cmolter] is 
saying above, what we see in the logs just prior to a broker becoming 
under-replicated is a flurry of

{noformat}
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition.
{noformat}

messages. After that we see a bunch of activity around adding and removing 
fetchers, then it goes into the infinite ISR shrink loop. The only way we can 
recover is to restart.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-11-23 Thread derek (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15690421#comment-15690421
 ] 

derek commented on KAFKA-2729:
--

I'm on 0.10.1.0 and seeing the same thing. Maybe related to [~cmolter] is 
saying above, what we see in the logs just prior to a broker becoming 
under-replicated is a flurry of

{noformat}
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is 
not the leader for that topic-partition.
{noformat}

messages. After that we see a bunch of activity around adding and removing 
fetchers, then it goes into the infinite ISR shrink loop. The only way we can 
recover is to restart.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-09-06 Thread Charly Molter (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15467691#comment-15467691
 ] 

Charly Molter commented on KAFKA-2729:
--

Hi,

We had this issue on a test cluster so I took time to investigate some more.

We had a bunch of disconnections to Zookeeper and we had 2 changes of 
controller in a short time.

Broker 103 was leader with epoch 44
Broker 104 was leader with epoch 45

I looked at one specific partitions and found the following pattern:

101 was the broker which thought was leader but kept failing shrink the ISR 
with:
Partition [verifiable-test-topic,0] on broker 101: Shrinking ISR for partition 
[verifiable-test-topic,0] from 101,301,201 to 101,201
Partition [verifiable-test-topic,0] on broker 101: Cached zkVersion [185] not 
equal to that in zookeeper, skip updating ISR

Looking at ZK we have:
get /brokers/topics/verifiable-test-topic/partitions/0/state
{"controller_epoch":44,"leader":301,"version":1,"leader_epoch":96,"isr":[301]}

And metadata (to a random broker) is saying:
Topic: verifiable-test-topicPartition: 0Leader: 301 Replicas: 
101,201,301   Isr: 301

Digging in the logs here’s what we think happened:

1. 103 sends becomeFollower to 301 with epoch 44 and leaderEpoch 95
2. 104 sends becomeLeader to 101 with epoch 45 and leaderEpoch 95 (after update 
zk!)
3. 103 sends becomeLeader to 301 with epoch 44 and leaderEpoch 96 (after 
updating zk!)
4. 104 sends becomeFollower to 301 with epoch 45 and leaderEpoch 95

4) Is ignored by 301 as the leaderEpoch is older than the current one.

We are missing a request: 103 sends becomeFollower to 101 with epoch 44 and 
leaderEpoch 95

I believe this happened because when the controller steps down it empties its 
request queue so this request never left the controller: 
https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/ControllerChannelManager.scala#L53-L57

So we ended up in a case where 301 and 101 think they are both leaders. 
Obviously 101 wants to update the state in ZK to remove 301 as it’s not even 
fetching from 101.

Does this seem correct to you?

It seems impossible to avoid having no Controller overlap, which could make it 
quite hard to avoid having 2 leaders for a short time. Though there should be a 
way for this situation to get back to a good state.

I believe the impact of this would be:
- writes = -1 unavailability
- writes != -1 possible log divergence depending on min in-sync replicas (I’m 
unsure about this).

Hope this helps. While I had to fix the cluster by bouncing a node I kept most 
of the logs so let me know if you need more info.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-08-08 Thread Kane Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412359#comment-15412359
 ] 

Kane Kim commented on KAFKA-2729:
-

For us the reason was high percentage of lost packets to one of ZK nodes (from 
broker to ZK). After we fixed that, situation got a lot better.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-08-08 Thread Michael Sandrof (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15412126#comment-15412126
 ] 

Michael Sandrof commented on KAFKA-2729:


We seem to be having a similar problem running 0.10.0.0. However no amount of 
broker restarting corrects the problem. Once it happens, I see periodic "Cached 
zkVersion" messages along with complete instability in the ISRs. Continuous 
shrinking and expanding of the ISRs that makes the cluster unusable as we need 
2 ISRs for our durability requirements. 

The only thing that fixes the problem is to delete all topics, recreate and 
reload. This isn't a practical approach for our production system in which we 
are using Kafka as a transactionally consistent replica of a relational 
database.

Anyone have any clues about how to prevent this from happening?

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-08-01 Thread Joshua Dickerson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402900#comment-15402900
 ] 

Joshua Dickerson commented on KAFKA-2729:
-

This has bit us twice in our live environment. 0.9.0.1
Restarting the affected broker(s) is the only thing that seems to fix it. 

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-07-29 Thread Konstantin Zadorozhny (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399928#comment-15399928
 ] 

Konstantin Zadorozhny commented on KAFKA-2729:
--

Seeing the same issue in our staging and production environments on 0.9.0.1. 
Bouncing brokers helps, but still not ideal.

Staging cluster were left to "recover" for a day. Didn't happen.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-07-27 Thread James Carnegie (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395915#comment-15395915
 ] 

James Carnegie commented on KAFKA-2729:
---

That's our experience, though the only other thing we've tried is leaving it 
for a while.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-07-27 Thread William Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395821#comment-15395821
 ] 

William Yu commented on KAFKA-2729:
---

We are also seeing this in our production cluster: Running on Kafka: 0.9.0.1

Is restarting the only solution?

{code}
[2016-07-27 14:36:15,807] INFO Partition [tasks,265] on broker 4: Cached 
zkVersion [182] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-07-27 14:36:15,807] INFO Partition [tasks,150] on broker 4: Shrinking ISR 
for partition [tasks,150] from 6,4,7 to 4 (kafka.cluster.Partition)
{code}

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-07-13 Thread Tyler Bischel (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375878#comment-15375878
 ] 

Tyler Bischel commented on KAFKA-2729:
--

We are also seeing this issue in 0.10.0.0 pretty much daily right now.
{code}
[2016-07-13 21:30:50,170]  1292384 [kafka-scheduler-0] INFO  
kafka.cluster.Partition  - Partition [events,580] on broker 10432234: Cached 
zkVersion [1267] not equal to that in zookeeper, skip updating ISR
{code}

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-07-07 Thread Chris Rodier (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366237#comment-15366237
 ] 

Chris Rodier commented on KAFKA-2729:
-

We also observed this identical issue, on 0.9.0.1 today.  Restart of the failed 
broker resolved the issue without difficulty as a work around.  This seems like 
a high priority issue where you could lose nodes, and/or lose a cluster fairly 
easily due to zookeeper instability / elections.



> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-05-30 Thread Joel Pfaff (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306654#comment-15306654
 ] 

Joel Pfaff commented on KAFKA-2729:
---

We have hit that as well on 0.9.0.1 today, same logs, and only a reboot of the 
faulty broker recovered the problem.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-04-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263678#comment-15263678
 ] 

Stig Rohde Døssing commented on KAFKA-2729:
---

We hit this on 0.9.0.1 today
{code}
[2016-04-28 19:18:22,834] INFO Partition [dce-data,13] on broker 3: Shrinking 
ISR for partition [dce-data,13] from 3,2 to 3 (kafka.cluster.Partition)
[2016-04-28 19:18:22,845] INFO Partition [dce-data,13] on broker 3: Cached 
zkVersion [304] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-04-28 19:18:32,785] INFO Partition [dce-data,16] on broker 3: Shrinking 
ISR for partition [dce-data,16] from 3,2 to 3 (kafka.cluster.Partition)
[2016-04-28 19:18:32,803] INFO Partition [dce-data,16] on broker 3: Cached 
zkVersion [312] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
{code}
which continued until we rebooted broker 3. The ISR at this time in Zookeeper 
had only broker 2, and there was no leader for the affected partitions. I 
believe the preferred leader for these partitions was 3.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-04-27 Thread Kane Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261189#comment-15261189
 ] 

Kane Kim commented on KAFKA-2729:
-

Same problem with the same symptoms occurred on kafka 0.8.2.1. After network 
glitch brokers fall out of ISR set with
Cached zkVersion [5] not equal to that in zookeeper, skip updating ISR

Broker never recovers from this state until restart.



> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-02-28 Thread Michal Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171252#comment-15171252
 ] 

Michal Harish commented on KAFKA-2729:
--

Hit this on Kafka 0.8.2.2 as well

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-02-23 Thread Petri Lehtinen (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158771#comment-15158771
 ] 

Petri Lehtinen commented on KAFKA-2729:
---

This happened to me (again) a few days ago on 0.9.0.0 on a cluster of 2 kafka 
nodes.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-02-17 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151423#comment-15151423
 ] 

James Cheng commented on KAFKA-2729:


We ran into the same issue today, when running 0.9.0.0.

{quote}
[2016-02-17 22:49:52,638] INFO Partition [the.topic.name,22] on broker 2: 
Cached zkVersion [5] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
{quote}


> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-02-04 Thread Elias Levy (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133716#comment-15133716
 ] 

Elias Levy commented on KAFKA-2729:
---

Had the same issue happen here while testing a 5 node Kafka cluster with a 3 
node ZK ensemble on Kubernetes on AWS.  After running for a while broker 2 
started showing the "Cached zkVersion [29] not equal to that in zookeeper, skip 
updating ISR" error message for al the partitions it leads.  For those 
partition it is the only in sync replica.  That has led to the Samza jobs I was 
running to stop.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2016-01-13 Thread Andres Gomez Ferrer (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095901#comment-15095901
 ] 

Andres Gomez Ferrer commented on KAFKA-2729:


Our kafka cluster met the same issue too:

```
[2016-01-12 01:16:15,907] INFO Partition [__consumer_offsets,10] on broker 0: 
Shrinking ISR for partition [__consumer_offsets,10] from 0,1 to 0 
(kafka.cluster.Partition)
[2016-01-12 01:16:15,909] INFO Partition [__consumer_offsets,10] on broker 0: 
Cached zkVersion [3240] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-01-12 01:16:15,909] INFO Partition [__consumer_offsets,45] on broker 0: 
Shrinking ISR for partition [__consumer_offsets,45] from 0,1 to 0 
(kafka.cluster.Partition)
[2016-01-12 01:16:15,911] INFO Partition [__consumer_offsets,45] on broker 0: 
Cached zkVersion [3192] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
[2016-01-12 01:16:15,911] INFO Partition [__consumer_offsets,24] on broker 0: 
Shrinking ISR for partition [__consumer_offsets,24] from 0,1 to 0 
(kafka.cluster.Partition)
[2016-01-12 01:16:15,912] INFO Partition [__consumer_offsets,24] on broker 0: 
Cached zkVersion [3233] not equal to that in zookeeper, skip updating ISR 
(kafka.cluster.Partition)
```

Kafka version is 0.8.2.2

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2729) Cached zkVersion not equal to that in zookeeper, broker not recovering.

2015-11-24 Thread Iskandarov Eduard (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025008#comment-15025008
 ] 

Iskandarov Eduard commented on KAFKA-2729:
--

Our kafka cluster met the same issue:
{noformat}
kafka2 1448388319:093 [2015-11-24 21:05:19,387] INFO Partition 
[dstat_wc_cpl_log,13] on broker 2: Shrinking ISR for partition 
[dstat_wc_cpl_log,13] from 2,1 to 2 (kafka.cluster.Partition)
kafka2 1448388319:094 [2015-11-24 21:05:19,404] INFO Partition 
[dstat_wc_cpl_log,13] on broker 2: Cached zkVersion [332] not equal to that in 
zookeeper, skip updating ISR (kafka.cluster.Partition)
{noformat}

We use confluent.io's kafka distribution.
Kafka version is 0.8.2.2.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> ---
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.2.1
>Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)