[ https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331093#comment-16331093 ]
Jeff Widman edited comment on KAFKA-1120 at 1/18/18 7:58 PM: ------------------------------------------------------------- The issue description says "the broker will be in this weird state until it is restarted." Could this also be fixed by simply forcing a controller re-election through removing the /controller znode? Since it will re-identify the leaders? In some scenarios, it seems that might be a lighter-weight solution. I understand this does not fix the root code cause, but just want to be sure I understand what options I have if we hit this in an emergency situation. was (Author: jeffwidman): The issue description says "the broker will be in this weird state until it is restarted." Couldn't this also be fixed by simply forcing a controller re-election? Since it will re-identiy the leaders? > Controller could miss a broker state change > -------------------------------------------- > > Key: KAFKA-1120 > URL: https://issues.apache.org/jira/browse/KAFKA-1120 > Project: Kafka > Issue Type: Sub-task > Components: core > Affects Versions: 0.8.1 > Reporter: Jun Rao > Assignee: Mickael Maison > Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > When the controller is in the middle of processing a task (e.g., preferred > leader election, broker change), it holds a controller lock. During this > time, a broker could have de-registered and re-registered itself in ZK. After > the controller finishes processing the current task, it will start processing > the logic in the broker change listener. However, it will see no broker > change and therefore won't do anything to the restarted broker. This broker > will be in a weird state since the controller doesn't inform it to become the > leader of any partition. Yet, the cached metadata in other brokers could > still list that broker as the leader for some partitions. Client requests > routed to that broker will then get a TopicOrPartitionNotExistException. This > broker will continue to be in this bad state until it's restarted again. -- This message was sent by Atlassian JIRA (v7.6.3#76005)