Hey! I’m seeing behaviour I didn’t expect, and I’d like some expert opinions on 
it.

(TL;DR min.insync.replicas=2 and acks=all isn’t enough to tolerate losing a 
leader transparently and safely, but maybe it could be)

Lets say we have a topic with one partition, replication factor 3.
Leader 0, Followers 1,2
I have min.insync.replicas=2, and my producer has acks=all

Ok, so let’s say there’s an incident, and Broker 2 dies (or is unable to 
replicate).
That’s still fine, and the producer keeps going.

Then Broker 1 dies (or is unable to replicate).
Now the ISR list will just be Broker 0, and production will stop because of 
min.insync.replicas and acks=all.

Now, Broker 0 dies.
The partition is now offline, which is expected.

Let’s say Broker 1 comes back before Broker 0 (or maybe Broker 0 is never 
coming back).
There’s no reason now why Broker 1 can’t assume leadership. Due to acks=all 
nothing was produced while it was gone, but it currently can’t know that. When 
it comes back, it sees it’s not in the list of ISRs and assumes it may have 
missed something.

The only way to get out of this situation is to enable unclean leadership 
elections, and in this case if Broker 1 assumed leadership that’d be fine, but 
if Broker 2 assumed leadership then there’d be consumer offset reset, since 
Broker 1 has messages Broker 2 doesn’t.

So that’s the setup.
So, really all that min.insync.replicas=2 gets me is less Availability earlier, 
and if Broker 0 comes back, then consistency.
But if it doesn’t, I might lose consistency anyway.
Let me know if I’m wrong about something in there.

Proposal:
What we need is to track is whether a broker is missing messages. We obviously 
don’t want to spam ZK each time we get a message, and we don’t want to change 
what ISR means because so many things (including acks=all) rely on that.
So what if we added (for each partition) a list of “out of sync” partitions.
Basically when a broker gets removed from the ISR list, the first time (and 
only the first time) a produce is accepted they’re added to the OSRs.

That way if Broker 1 comes back it can see that Broker 0 is the leader, but 
before Broker 0 died it never got any messages Broker 1 doesn’t already have.
So Broker 1 is now safe to take leadership without having to enable unclean 
leadership, and start syncing to Broker 2, and once that’s in sync with what it 
missed, we’re now within min.insync.replicas and things become available again.

There are still some edges in here.
Like maybe instead of having things be implicit (“I’m not in this list, so I 
must be ok”) having an explicit ISRs and UpToDateRs we only take things out of 
when we receive a message and they’re not in ISR.

Or like what if Broker 1 suffered corruption or data-loss. It probably 
shouldn’t be able to just start up, so maybe instead of a broker list there’s a 
kind of highwater-mark we set to the current offset when we remove them from 
ISR, and then remove when we get a new message.
So they can see on startup “I’m in sync if I have THIS message, which I do, so 
I’m the leader now”

Basically I just wanted to know if this is ridiculous, or if I’m 
misunderstanding, or if I should make a KIP or what happens here.
Because right now it feels like setting min.insync.replicas=2 doesn’t actually 
give me much in a scenario like I outlined above.
With min.insync.replicas=1 producers would think they succeeded, and then the 
messages would be lost if Broker 0 went offline.
But if I need to enable unclean leadership election anyway to recover, then 
those messages might be lost anyway, no?

Thanks for reading!

Reply via email to