[ https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865486#comment-15865486 ]
Prasanna Gautam commented on KAFKA-2729: ---------------------------------------- This is still replicable in Kafka 0.10.1.1 when Kafka brokers are partitioned from each other and zookeeper gets disconnected from the brokers briefly and comes back. This situation leads to brokers getting stuck in comparing Cached zkVersion and unable to expand the ISR. The code in Partition.scala does not seem to be handling enough error conditions other than the stale zkVersion. In addition to skipping in the current loop, I think it should reconnect to zookeeper to update the current state and version. Here's a suggestion to do this.. doing it asynchronously doesn't break the flow and you can update the state. ZkVersion may not be the only thing to update here. {code} val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(r => r.brokerId).toList, zkVersion) val (updateSucceeded,newVersion) = ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partitionId, newLeaderAndIsr, controllerEpoch, zkVersion) if(updateSucceeded) { replicaManager.recordIsrChange(new TopicAndPartition(topic, partitionId)) inSyncReplicas = newIsr zkVersion = newVersion trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion)) } else { info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion)) zkVersion = asyncUpdateTopicPartitionVersion(topic,partitionId) } {code} > Cached zkVersion not equal to that in zookeeper, broker not recovering. > ----------------------------------------------------------------------- > > Key: KAFKA-2729 > URL: https://issues.apache.org/jira/browse/KAFKA-2729 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Reporter: Danil Serdyuchenko > > After a small network wobble where zookeeper nodes couldn't reach each other, > we started seeing a large number of undereplicated partitions. The zookeeper > cluster recovered, however we continued to see a large number of > undereplicated partitions. Two brokers in the kafka cluster were showing this > in the logs: > {code} > [2015-10-27 11:36:00,888] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for > partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5 > (kafka.cluster.Partition) > [2015-10-27 11:36:00,891] INFO Partition > [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66] > not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) > {code} > For all of the topics on the effected brokers. Both brokers only recovered > after a restart. Our own investigation yielded nothing, I was hoping you > could shed some light on this issue. Possibly if it's related to: > https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using > 0.8.2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346)