[ https://issues.apache.org/jira/browse/KAFKA-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124325#comment-16124325 ]
Pierre Mage commented on KAFKA-5074: ------------------------------------ Running 0.11.0 and observing similar behaviour. Sequence of events recorded in logs: 1. ZooKeeper session expires 2. Kafka controller stops broker 0 3. Kafka re-register broker 0 in ZooKeeper 4. Leader cache \[mytopic,29\] -> (Leader:2,ISR:2,0,LeaderEpoch:0,ControllerEpoch:1) 5. Invoking state change to OfflineReplica for replicas \[Topic=mytopic,Partition=29,Replica=0\] 6. Retaining last ISR 0 of partition \[mytopic,29\] since unclean leader election is disabled 7. New leader and ISR for partition \[mytopic,29\] is {"leader":-1,"leader_epoch":4,"isr":[0]} 8. Not sending request (type=StopReplicaRequest...) to broker 0, since it is offline 9. Invoking state change to OnlineReplica for replicas \[Topic=mytopic,Partition=29,Replica=0\] 10. Cycle of failing preferred leader elections starts OfflinePartitionLeaderSelector is not called as the partition's state is still OnlinePartition. {code} ERROR Controller 2 epoch 4 encountered error while electing leader for partition [mytopic,29] due to: Preferred replica 2 for partition [mytopci,29] is either not alive or not in the isr. Current leader and ISR [{"leader":-1,"leader_epoch":4,"isr":[0]}]. ERROR Controller 2 epoch 4 initiated state change for partition [mytopic,29] from OnlinePartition to OnlinePartition failed {code} > Transition to OnlinePartition without preferred leader in ISR fails > ------------------------------------------------------------------- > > Key: KAFKA-5074 > URL: https://issues.apache.org/jira/browse/KAFKA-5074 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.9.0.0 > Reporter: Dustin Cote > > Running 0.9.0.0, the controller can get into a state where it no longer is > able to elect a leader for an Offline partition. It's unclear how this state > is first achieved but in the steady state, this happens: > -There are partitions with a leader of -1 > -The Controller repeatedly attempts a preferred leader election for these > partitions > -The preferred leader election fails because the only replica in the ISR is > not the preferred leader > The log cycle looks like this: > {code} > [2017-04-12 18:00:18,891] INFO [Controller 8]: Starting preferred replica > leader election for partitions topic,1 > [2017-04-12 18:00:18,891] INFO [Partition state machine on Controller 8]: > Invoking state change to OnlinePartition for partitions topic,1 > [2017-04-12 18:00:18,892] INFO [PreferredReplicaPartitionLeaderSelector]: > Current leader -1 for partition [topic,1] is not the preferred replica. > Trigerring preferred replica leader election > (kafka.controller.PreferredReplicaPartitionLeaderSelector) > [2017-04-12 18:00:18,893] WARN [Controller 8]: Partition [topic,1] failed to > complete preferred replica leader election. Leader is -1 > (kafka.controller.KafkaController) > {code} > It's not clear if this would happen on versions later that 0.9.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029)