[ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277146#comment-16277146
 ] 

Ramnatthan Alagappan edited comment on KAFKA-1120 at 12/4/17 5:46 PM:
----------------------------------------------------------------------

I ran into this issue and have a reproducible setup irrespective of the number 
of partitions or nodes. [~onurkaraman]'s analysis in comment @  
[#comment-16113645] is correct. The root cause is that the shutdown broker 
restarts and registers with ZK in a short interval of time. When the broker 
shutsdown, ZK delivers a callback for deletion of the broker. Before ZKClient 
can reestablish the callback (by issuing a stat call), the broker registers 
with ZK. By the time ZKClient gets the /brokers/ids node from ZK, the shutdown 
broker also appears in /brokers/ids. With this, the shutdown broker appears 
both in curBrokerIds and liveOrShuttingDownBrokerIds, causing newBrokerIds to 
be empty, which causes this problem. 


was (Author: ramanala):
I ran into this issue and have a reproducible setup irrespective of the number 
of partitions or nodes. [~onurkaraman]'s analysis in comment @  
[#comment-16113645] is correct. The root cause is that the shutdown broker 
restarts and registers with ZK in a short interval of time. During this time, 
ZK delivers a callback for deletion of the broker. Before ZKClient can 
reestablish the callback (by issuing a stat call), the broker registers with 
ZK. By the time ZKClient gets the /brokers/ids node from ZK, the shutdown 
broker also appears in /brokers/ids. With this, the shutdown broker appears 
both in curBrokerIds and liveOrShuttingDownBrokerIds, causing newBrokerIds to 
be empty, which causes this problem. 

> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>            Assignee: Mickael Maison
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to