[
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266652#comment-16266652
]
Mickael Maison commented on KAFKA-1120:
---------------------------------------
We've hit this issue ([KAFKA-3944]) a number of times in our 0.10.2 clusters,
when a broker is restarted too fast, it can receive a bunch of StopReplica
requests from the controller which is still processing the ControllerShutdown
request. As [~wushujames] has mentioned, pre 0.11 apart from waiting "long
enough" there's unfortunately not really a way to be sure the controller is
done processing the ControlledShutdown request.
As far as I can tell this issue should still present in 1.0.0 (I haven't had a
chance to try reproducing yet). If so, the solution suggested by [~onurkaraman]
to use the broker's session id to prevent broker to process requests for the
previous generation should work.
> Controller could miss a broker state change
> --------------------------------------------
>
> Key: KAFKA-1120
> URL: https://issues.apache.org/jira/browse/KAFKA-1120
> Project: Kafka
> Issue Type: Sub-task
> Components: core
> Affects Versions: 0.8.1
> Reporter: Jun Rao
> Labels: reliability
> Fix For: 1.1.0
>
>
> When the controller is in the middle of processing a task (e.g., preferred
> leader election, broker change), it holds a controller lock. During this
> time, a broker could have de-registered and re-registered itself in ZK. After
> the controller finishes processing the current task, it will start processing
> the logic in the broker change listener. However, it will see no broker
> change and therefore won't do anything to the restarted broker. This broker
> will be in a weird state since the controller doesn't inform it to become the
> leader of any partition. Yet, the cached metadata in other brokers could
> still list that broker as the leader for some partitions. Client requests
> routed to that broker will then get a TopicOrPartitionNotExistException. This
> broker will continue to be in this bad state until it's restarted again.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)