[ 
https://issues.apache.org/jira/browse/KAFKA-9672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236416#comment-17236416
 ] 

Jose Armando Garcia Sancio commented on KAFKA-9672:
---------------------------------------------------

I was not able to reproduce this issue but looking at the code and the trace of 
messages sent by the controller this is what I think it is happening.

Assuming that the initial partition assignment and state is:
{code:java}
Replicas: 0, 1, 2
ISR: 0, 1, 2
Leader: 0
LeaderEpoch: 1{code}
This state is replicated to all of the replicas (0, 1, 2) using the 
LeaderAndIsr requests.

When the user attempts to perform a reassignment of replacing 0 with 3, the 
controller bumps the epoch and assignment info
{code:java}
Replicas: 0, 1, 2, 3
Adding: 3
Removing: 0
ISR: 0, 1, 2
Leader: 0
LeaderEpoch: 2{code}
This state is replicated to all of the replicas (0, 1, 2, 3) using the 
LeaderAndIsr request.

The system roughly stays in this state until the all of the target replicas 
have join the ISR. When all of the target replicas have join the ISR the 
controller wants to perform the following flow:

1 - The controller moves the leader if necessary (leader is not in the new 
replicas set) and stops the leader from letting "removing" replicas to join the 
ISR.

The second requirement (stopping the leader from adding "removing" replicas to 
the ISR) is accomplished by bumping the leader epoch and only sending the new 
leader epoch to the target replicas (1, 2, 3). Unfortunately, due to how the 
controller is implemented this is accomplished by deleting the "removing" 
replicas from the in memory state without modifying the ISR state. At this 
point we have the ZK state:
{code:java}
Replicas: 0, 1, 2, 3
Adding:
Removing: 0
ISR: 0, 1, 2, 3
Leader: 1
LeaderEpoch: 3{code}
but the following LeaderAndIsr requests are sent to replicas 1, 2, 3
{code:java}
Replicas: 1, 2, 3
Adding:
Removing:
ISR: 0, 1, 2, 3
Leader: 1
LeaderEpoch: 3{code}
This works because replica 0 will have an invalid leader epoch which means that 
it's Fetch request will be ignored by the (new) leader.

2 - The controller removes replica 0 from the ISR by updating ZK and sending 
the appropriate LeaderAndIsr requests.

3 - The controller removes replica 0 from the replica set by updating ZK and 
sending the appropriate LeaderAndIsr requests.

 

Conclusion

If this flow executes to completion, everything is okay. The problem is what 
happens if step 2. and 3. don't get to execute. I am unable to reproduce this 
with tests or by walking the code but if 2. and 3. don't execute but the 
controller stays alive there is a flow where the controller persists the 
following state to ZK
{code:java}
Replicas: 1, 2, 3
Adding:
Removing:
ISR: 0, 1, 2, 3
Leader: 1
LeaderEpoch: 3{code}
Which causes the reassignment flow to terminate with the system staying in this 
state. This state is persistent at this line in the controller code:

https://github.com/apache/kafka/blob/43fd630d80332f2b3b3512a712100825a8417704/core/src/main/scala/kafka/controller/KafkaController.scala#L728

> Dead brokers in ISR cause isr-expiration to fail with exception
> ---------------------------------------------------------------
>
>                 Key: KAFKA-9672
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9672
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.0, 2.4.1
>            Reporter: Ivan Yurchenko
>            Assignee: Jose Armando Garcia Sancio
>            Priority: Major
>
> We're running Kafka 2.4 and facing a pretty strange situation.
>  Let's say there were three brokers in the cluster 0, 1, and 2. Then:
>  1. Broker 3 was added.
>  2. Partitions were reassigned from broker 0 to broker 3.
>  3. Broker 0 was shut down (not gracefully) and removed from the cluster.
>  4. We see the following state in ZooKeeper:
> {code:java}
> ls /brokers/ids
> [1, 2, 3]
> get /brokers/topics/foo
> {"version":2,"partitions":{"0":[2,1,3]},"adding_replicas":{},"removing_replicas":{}}
> get /brokers/topics/foo/partitions/0/state
> {"controller_epoch":123,"leader":1,"version":1,"leader_epoch":42,"isr":[0,2,3,1]}
> {code}
> It means, the dead broker 0 remains in the partitions's ISR. A big share of 
> the partitions in the cluster have this issue.
> This is actually causing an errors:
> {code:java}
> Uncaught exception in scheduled task 'isr-expiration' 
> (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.ReplicaNotAvailableException: Replica with id 
> 12 is not available on broker 17
> {code}
> It means that effectively {{isr-expiration}} task is not working any more.
> I have a suspicion that this was introduced by [this commit (line 
> selected)|https://github.com/apache/kafka/commit/57baa4079d9fc14103411f790b9a025c9f2146a4#diff-5450baca03f57b9f2030f93a480e6969R856]
> Unfortunately, I haven't been able to reproduce this in isolation.
> Any hints about how to reproduce (so I can write a patch) or mitigate the 
> issue on a running cluster are welcome.
> Generally, I assume that not throwing {{ReplicaNotAvailableException}} on a 
> dead (i.e. non-existent) broker, considering them out-of-sync and removing 
> from the ISR should fix the problem.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to