Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

Satish Duggana Sun, 27 Jun 2021 04:01:36 -0700

Hi Dhruvil,
Thanks for looking into the KIP and providing your comments.

There are two problems about the scenario raised in this KIP:


a) Leader is slow and it is not available for reads or writes.
b) Leader is causing the followers to be out of sync and cause the
partitions unavailability.

(a) should be detected and mitigated so that the broker can become a
leader or replace with a different node if this node continues having
issues.

(b) will cause the partition to go under minimum ISR and eventually
make that partition offline if the leader goes down. In this case,
users have to enable unclean leader election for making the partition
available. This may cause data loss based on the replica chosen as a
leader. This is what several folks(including us) observed in their
production environments.

Solution(1) in the KIP addresses (b) to avoid offline partitions by
not removing the replicas from ISR. This allows the partition to be
available if the leader is moved to one of the other replicas in ISR.

Solution (2) in the KIP extends solution (1) by relinquishing the
leadership and allowing one of the other insync replicas to become a
leader.

~Satish.

Re: [DISCUSS] KIP-501 Avoid out-of-sync or offline partitions when follower fetch requests are not processed in time

Reply via email to