[
https://issues.apache.org/jira/browse/KAFKA-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865486#comment-15865486
]
Prasanna Gautam commented on KAFKA-2729:
----------------------------------------
This is still replicable in Kafka 0.10.1.1 when Kafka brokers are partitioned
from each other and zookeeper gets disconnected from the brokers briefly and
comes back. This situation leads to brokers getting stuck in comparing Cached
zkVersion and unable to expand the ISR.
The code in Partition.scala does not seem to be handling enough error
conditions other than the stale zkVersion. In addition to skipping in the
current loop, I think it should reconnect to zookeeper to update the current
state and version.
Here's a suggestion to do this.. doing it asynchronously doesn't break the flow
and you can update the state. ZkVersion may not be the only thing to update
here.
{code}
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(r
=> r.brokerId).toList, zkVersion)
val (updateSucceeded,newVersion) =
ReplicationUtils.updateLeaderAndIsr(zkUtils, topic, partitionId,
newLeaderAndIsr, controllerEpoch, zkVersion)
if(updateSucceeded) {
replicaManager.recordIsrChange(new TopicAndPartition(topic, partitionId))
inSyncReplicas = newIsr
zkVersion = newVersion
trace("ISR updated to [%s] and zkVersion updated to
[%d]".format(newIsr.mkString(","), zkVersion))
} else {
info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating
ISR".format(zkVersion))
zkVersion = asyncUpdateTopicPartitionVersion(topic,partitionId)
}
{code}
> Cached zkVersion not equal to that in zookeeper, broker not recovering.
> -----------------------------------------------------------------------
>
> Key: KAFKA-2729
> URL: https://issues.apache.org/jira/browse/KAFKA-2729
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.2.1
> Reporter: Danil Serdyuchenko
>
> After a small network wobble where zookeeper nodes couldn't reach each other,
> we started seeing a large number of undereplicated partitions. The zookeeper
> cluster recovered, however we continued to see a large number of
> undereplicated partitions. Two brokers in the kafka cluster were showing this
> in the logs:
> {code}
> [2015-10-27 11:36:00,888] INFO Partition
> [__samza_checkpoint_event-creation_1,3] on broker 5: Shrinking ISR for
> partition [__samza_checkpoint_event-creation_1,3] from 6,5 to 5
> (kafka.cluster.Partition)
> [2015-10-27 11:36:00,891] INFO Partition
> [__samza_checkpoint_event-creation_1,3] on broker 5: Cached zkVersion [66]
> not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> {code}
> For all of the topics on the effected brokers. Both brokers only recovered
> after a restart. Our own investigation yielded nothing, I was hoping you
> could shed some light on this issue. Possibly if it's related to:
> https://issues.apache.org/jira/browse/KAFKA-1382 , however we're using
> 0.8.2.1.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)