Hi all, We had an outage last week that I think we could have prevented, and I'd like to get some feedback on the idea.
tl;dr: When a partition leader writes an updated ISR, it should also record its current log-end-offset. On leader election, if there are no live replicas in the ISR, then a replica with this same log-end-offset should be preferred before considering unclean leader election. Details and use case: We have a 5-node Kafka 1.0.0 cluster (since upgraded to 1.1.0) with unclean leader election disabled. Well-configured topics have replication factor 3 and min.insync.replicas 2, with producers setting acks=all. On Monday our cloud provider suffered hardware failure, causing a partial outage on network connectivity to disk storage. Broker 5's storage was on the orphaned side of the network partition. At the very start of the incident, broker 5 dropped all followers on brokers 1 and 4 out of the ISR for partitions it was leading. Its connections to brokers 2 and 3 and to Zookeeper stayed up, including to the controller on broker 3. Broker 5 went offline entirely a few moments later, and stayed down with disk state inaccessible for several hours. We had configured multiple partitions with broker 5 as their leader and followers on brokers 1 and 4. Before the incident those partitions had ISR {5,1,4}, which shrank to {5} before broker 5 disappeared - leaving us with no eligible replicas to become leader. The only ways to bring these partitions back up were to either recover broker 5's up-to-date disk state, or to enable unclean leader election. Had we lost one follower, then the other, and then the leader, enabling unclean leader election would have carried 50% risk of message loss. In the end, we decided that the lowest-risk option was to enable unclean leader election on the affected topics, force a controller election, watch the partitions come back up, and disable unclean election. I think there's a safer recovery path that Kafka could support: The leader should also record its current log-end-offset when it writes an updated ISR. If the controller determines that it can't promote a replica from the ISR, it should next look for a replica that has that same log-end-offset. Only if that step also fails should it then consider unclean leader election. For our failure case, at least, this would have allowed a clean and automatic recovery. Has this idea been considered before? Does it have fatal flaws? Thanks, -- Jack Foy <j...@hiya.com>