Hi all,

We had an outage last week that I think we could have prevented, and I'd
like to get some feedback on the idea.

tl;dr:

When a partition leader writes an updated ISR, it should also record its current
log-end-offset. On leader election, if there are no live replicas in
the ISR, then a
replica with this same log-end-offset should be preferred before considering
unclean leader election.

Details and use case:

We have a 5-node Kafka 1.0.0 cluster (since upgraded to 1.1.0) with unclean
leader election disabled. Well-configured topics have replication factor 3 and
min.insync.replicas 2, with producers setting acks=all.

On Monday our cloud provider suffered hardware failure, causing a partial
outage on network connectivity to disk storage. Broker 5's storage was on the
orphaned side of the network partition.

At the very start of the incident, broker 5 dropped all followers on brokers 1
and 4 out of the ISR for partitions it was leading. Its connections to brokers
2 and 3 and to Zookeeper stayed up, including to the controller on broker 3.
Broker 5 went offline entirely a few moments later, and stayed down with disk
state inaccessible for several hours.

We had configured multiple partitions with broker 5 as their leader and
followers on brokers 1 and 4. Before the incident those partitions had ISR
{5,1,4}, which shrank to {5} before broker 5 disappeared - leaving us with no
eligible replicas to become leader.

The only ways to bring these partitions back up were to either recover broker
5's up-to-date disk state, or to enable unclean leader election. Had we lost
one follower, then the other, and then the leader, enabling unclean leader
election would have carried 50% risk of message loss.

In the end, we decided that the lowest-risk option was to enable unclean leader
election on the affected topics, force a controller election, watch the
partitions come back up, and disable unclean election.

I think there's a safer recovery path that Kafka could support:

The leader should also record its current log-end-offset when it writes an
updated ISR. If the controller determines that it can't promote a replica from
the ISR, it should next look for a replica that has that same log-end-offset.
Only if that step also fails should it then consider unclean leader election.

For our failure case, at least, this would have allowed a clean and automatic
recovery. Has this idea been considered before? Does it have fatal flaws?

Thanks,

--
Jack Foy <j...@hiya.com>

Reply via email to