[ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107857#comment-16107857
 ] 

James Cheng edited comment on KAFKA-1120 at 7/31/17 7:53 PM:
-------------------------------------------------------------

[~noslowerdna] [~junrao],

I retested this will Kafka 0.11. The problem still exists.

I followed the steps from my  24/Feb/17 22:57 comment. I ran it maybe 10 times 
in a row. Every single time, the broker that I restarted came back up and did 
not take leadership for any partitions. In addition, it only became a follower 
for about half the partitions.

The fact that it became follower for half the partitions shows that the 
controller is at least aware that the broker exists (that is, the controller 
successfully saw the broker come back online.). But the controller didn't tell 
the broker to follow all the partitions that it should have.



was (Author: wushujames):
Hi,

I retested this will Kafka 0.11. The problem still exists.

I followed the steps from my  24/Feb/17 22:57 comment. I ran it maybe 10 times 
in a row. Every single time, the broker that I restarted came back up and did 
not take leadership for any partitions. In addition, it only became a follower 
for about half the partitions.

The fact that it became follower for half the partitions shows that the 
controller is at least aware that the broker exists (that is, the controller 
successfully saw the broker come back online.). But the controller didn't tell 
the broker to follow all the partitions that it should have.


> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>              Labels: reliability
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to