[ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277309#comment-16277309
 ] 

Ramnatthan Alagappan commented on KAFKA-1120:
---------------------------------------------

My way to constantly reproduce this issue might be not so correct or desired. 
However, here are the "steps". The key problem is with the broker registering 
with ZK before two events on the controller: 1. ZKClient re-registering 
callback for /brokers/ids (the callback is initially issued for the deletion of 
the broker), 2. ZKClient getting the data of /brokers/ids to check what has 
changed. This code is done in the fireChildChangedEvents function in the 
ZKClient library. To consistently trigger this issue, I patched ZkClient with a 
sleep inside this function: Thread.sleep(5000);  exists(path);  List<String> 
children = getChildren(path). With this added delay, the deletion handler would 
sleep before re-registering the callback. If I restart the shutdown broker 
during this sleep, I see that the restarted broker would never appear in 
newBrokerIds. 

> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>            Assignee: Mickael Maison
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to