[ 
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880817#comment-15880817
 ] 

Jun Rao commented on KAFKA-2729:
--------------------------------

[~prasincs], if the controller is partitioned off other brokers and ZK, the 
expected flow is the following: (1) ZK server detects that the old controller's 
session expires; (2) the controller path is removed by ZK; (3) a new controller 
is elected and changes leaders/isrs; (4) network is back on the old controller; 
(5) old controller receives ZK session expiration event; (6) old controller 
stops doing the controller stuff and resign. Note that the old controller 
doesn't really know that it's no longer the controller until step (5). The gap 
we have now is that step (6) is not done in a timely fashion.

Are you deploying Kafka in the same data center? What kind of network 
partitions are you seeing? Typically, we expect network partitions are rare 
within the same data center. If there are short network glitches, one temporary 
fix is to increase the ZK session timeout to accommodate for that until the 
network issue is fixed.

> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-2729
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2729
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other, 
> we started seeing a large number of undereplicated partitions. The zookeeper 
> cluster recovered, however we continued to see a large number of 
> undereplicated partitions. Two brokers in the kafka cluster were showing this 
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for 
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition 
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] 
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered 
> after a restart. Our own investigation yielded nothing, I was hoping you 
> could shed some light on this issue. Possibly if it's related to: 
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using 
> 0.8.2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to