[ https://issues.apache.org/jira/browse/KAFKA-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442343#comment-16442343 ]
Manikumar commented on KAFKA-5813: ---------------------------------- This might have fixed in async zk controller changes. > Unexpected unclean leader election due to leader/controller's unusual event > handling order > ------------------------------------------------------------------------------------------- > > Key: KAFKA-5813 > URL: https://issues.apache.org/jira/browse/KAFKA-5813 > Project: Kafka > Issue Type: Improvement > Affects Versions: 0.10.2.1 > Reporter: Allen Wang > Priority: Minor > > We experienced an unexpected unclean leader election after network glitch > happened on the leader of partition. We have replication factor 2. > Here is the sequence of event gathered from various logs: > 1. ZK session timeout happens for leader of partition > 2. New ZK session is established for leader > 3. Leader removes the follower from ISR (which might be caused by replication > delay due to the network problem) and updates the ISR in ZK > 4. Controller processes the BrokerChangeListener event happened at step 1 > where the leader seems to be offline > 5. Because the ISR in ZK is already updated by leader to remove the follower, > controller makes an unclean leader election > 6. Controller processes the second BrokerChangeListener event happened at > step 2 to mark the broker online again > It seems to me that step 4 happens too late. If it happens right after step > 1, it will be a clean leader election and hopefully the producer will > immediately switch to the new leader, thus avoiding consumer offset reset. -- This message was sent by Atlassian JIRA (v7.6.3#76005)