[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880817#comment-15880817 ]
Jun Rao commented on KAFKA-2729: -------------------------------- [~prasincs], if the controller is partitioned off other brokers and ZK, the expected flow is the following: (1) ZK server detects that the old controller's session expires; (2) the controller path is removed by ZK; (3) a new controller is elected and changes leaders/isrs; (4) network is back on the old controller; (5) old controller receives ZK session expiration event; (6) old controller stops doing the controller stuff and resign. Note that the old controller doesn't really know that it's no longer the controller until step (5). The gap we have now is that step (6) is not done in a timely fashion. Are you deploying Kafka in the same data center? What kind of network partitions are you seeing? Typically, we expect network partitions are rare within the same data center. If there are short network glitches, one temporary fix is to increase the ZK session timeout to accommodate for that until the network issue is fixed. > Cached zkVersion not equal to that in zookeeper, broker not recovering. > ----------------------------------------------------------------------- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)