[jira] [Commented] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

Flavio Junqueira (JIRA) Tue, 12 Apr 2016 12:34:49 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237824#comment-15237824
 ]


Flavio Junqueira commented on KAFKA-3042:
-----------------------------------------

hey [~wushujames]

bq. you said that broker 3 failed to release leadership to broker 4 because 
broker 4 was offline

it is actually broker the one that failed to release leadership.

bq. What is the correct behavior for that scenario?

The behavior isn't incorrect in the following sense. We cannot completely 
prevent a single broker from being partitioned from the other replicas. If this 
broker is the leader before the partition, then it may remain in this state for 
some time. In the meanwhile, the other replicas may form a new ISR and make 
progress independently. But, very important, the partitioned broker won't be 
able to commit anything on its own, assuming that the minimum ISR is at least 
two.

In the scenario we are discussing, we don't have a network partition, but the 
behavior is equivalent: broker 1 will remain the leader until it is able to 
follow successfully. The part is bad is that broker 1 isn't partitioned away, 
it is talking to other controllers, and the broker should be brought back into 
a state that it can make progress with that partition and others that are 
equally stuck. The bottom line is that is safe, but we clearly want the broker 
up and making progress with those partitions.

Let me point out that from the logs, it looks like you have unclean leader 
election enabled because of this log message:

{noformat}
[2016-04-09 00:40:50,911] WARN [OfflinePartitionLeaderSelector]: No broker in 
ISR is alive for [tec1.en2.frontend.syncPing,7]. 
Elect leader 4 from live brokers 4. There's potential data loss. 
(kafka.controller.OfflinePartitionLeaderSelector)
{noformat} 

and no minimum ISR set:

{noformat}
[2016-04-09 00:56:53,009] WARN [Controller 5]: Cannot remove replica 1 from ISR 
of partition [tec1.en2.frontend.syncPing,7]
since it is not in the ISR. Leader = 4 ; ISR = List(4) 
{noformat}

Those options can cause some data loss.

> updateIsr should stop after failed several times due to zkVersion issue
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-3042
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3042
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>         Environment: jdk 1.7
> centos 6.4
>            Reporter: Jiahongchao
>         Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

Reply via email to