[jira] [Comment Edited] (KAFKA-9672) Dead brokers in ISR cause isr-expiration to fail with exception

Jose Armando Garcia Sancio (Jira) Fri, 20 Nov 2020 12:15:58 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236420#comment-17236420
 ]


Jose Armando Garcia Sancio edited comment on KAFKA-9672 at 11/20/20, 8:14 PM:
------------------------------------------------------------------------------

Based on my observations here: 
https://issues.apache.org/jira/browse/KAFKA-9672?focusedCommentId=17236416&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17236416

Solution 1: I think the ideal solution is to never allow the ISR to be a 
superset of the replica set. Unfortunately, this is not easy to with how the 
controller implementation manages writes to ZK.

Solution 2: Another solution is to allow the ISR to be a superset of the 
replica set but also allow the Leader to remove replicas from the ISR if they 
are not in the replica set.


was (Author: jagsancio):
Based on my observations here: 
https://issues.apache.org/jira/browse/KAFKA-9672?focusedCommentId=17236416&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17236416

I think the ideal solution is to never allow the ISR to be a superset of the 
replica set. Unfortunately, this is not easy to with how the controller 
implementation manages writes to ZK.

Another solution is to allow the ISR to be a superset of the replica set but 
also allow the Leader to remove replicas from the ISR if they are not in the 
replica set.

> Dead brokers in ISR cause isr-expiration to fail with exception
> ---------------------------------------------------------------
>
>                 Key: KAFKA-9672
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9672
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.0, 2.4.1
>            Reporter: Ivan Yurchenko
>            Assignee: Jose Armando Garcia Sancio
>            Priority: Major
>
> We're running Kafka 2.4 and facing a pretty strange situation.
>  Let's say there were three brokers in the cluster 0, 1, and 2. Then:
>  1. Broker 3 was added.
>  2. Partitions were reassigned from broker 0 to broker 3.
>  3. Broker 0 was shut down (not gracefully) and removed from the cluster.
>  4. We see the following state in ZooKeeper:
> {code:java}
> ls /brokers/ids
> [1, 2, 3]
> get /brokers/topics/foo
> {"version":2,"partitions":{"0":[2,1,3]},"adding_replicas":{},"removing_replicas":{}}
> get /brokers/topics/foo/partitions/0/state
> {"controller_epoch":123,"leader":1,"version":1,"leader_epoch":42,"isr":[0,2,3,1]}
> {code}
> It means, the dead broker 0 remains in the partitions's ISR. A big share of 
> the partitions in the cluster have this issue.
> This is actually causing an errors:
> {code:java}
> Uncaught exception in scheduled task 'isr-expiration' 
> (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id 
> 12 is not available on broker 17
> {code}
> It means that effectively {{isr-expiration}} task is not working any more.
> I have a suspicion that this was introduced by [this commit (line 
> selected)|https://github.com/apache/kafka/commit/57baa4079d9fc14103411f790b9a025c9f2146a4#diff-5450baca03f57b9f2030f93a480e6969R856]
> Unfortunately, I haven't been able to reproduce this in isolation.
> Any hints about how to reproduce (so I can write a patch) or mitigate the 
> issue on a running cluster are welcome.
> Generally, I assume that not throwing {{ReplicaNotAvailableException}} on a 
> dead (i.e. non-existent) broker, considering them out-of-sync and removing 
> from the ISR should fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9672) Dead brokers in ISR cause isr-expiration to fail with exception

Reply via email to