[ https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237824#comment-15237824 ]
Flavio Junqueira commented on KAFKA-3042: ----------------------------------------- hey [~wushujames] bq. you said that broker 3 failed to release leadership to broker 4 because broker 4 was offline it is actually broker the one that failed to release leadership. bq. What is the correct behavior for that scenario? The behavior isn't incorrect in the following sense. We cannot completely prevent a single broker from being partitioned from the other replicas. If this broker is the leader before the partition, then it may remain in this state for some time. In the meanwhile, the other replicas may form a new ISR and make progress independently. But, very important, the partitioned broker won't be able to commit anything on its own, assuming that the minimum ISR is at least two. In the scenario we are discussing, we don't have a network partition, but the behavior is equivalent: broker 1 will remain the leader until it is able to follow successfully. The part is bad is that broker 1 isn't partitioned away, it is talking to other controllers, and the broker should be brought back into a state that it can make progress with that partition and others that are equally stuck. The bottom line is that is safe, but we clearly want the broker up and making progress with those partitions. Let me point out that from the logs, it looks like you have unclean leader election enabled because of this log message: {noformat} [2016-04-09 00:40:50,911] WARN [OfflinePartitionLeaderSelector]: No broker in ISR is alive for [tec1.en2.frontend.syncPing,7]. Elect leader 4 from live brokers 4. There's potential data loss. (kafka.controller.OfflinePartitionLeaderSelector) {noformat} and no minimum ISR set: {noformat} [2016-04-09 00:56:53,009] WARN [Controller 5]: Cannot remove replica 1 from ISR of partition [tec1.en2.frontend.syncPing,7] since it is not in the ISR. Leader = 4 ; ISR = List(4) {noformat} Those options can cause some data loss. > updateIsr should stop after failed several times due to zkVersion issue > ----------------------------------------------------------------------- > > Key: KAFKA-3042 > URL: https://issues.apache.org/jira/browse/KAFKA-3042 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.2.1 > Environment: jdk 1.7 > centos 6.4 > Reporter: Jiahongchao > Attachments: controller.log, server.log.2016-03-23-01, > state-change.log > > > sometimes one broker may repeatly log > "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR" > I think this is because the broker consider itself as the leader in fact it's > a follower. > So after several failed tries, it need to find out who is the leader -- This message was sent by Atlassian JIRA (v6.3.4#6332)