[ https://issues.apache.org/jira/browse/KAFKA-10853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249376#comment-17249376 ]
Guozhang Wang commented on KAFKA-10853: --------------------------------------- [~ambroff] I'm wondering, in this case where admin determines that L would not be recovered from this failure at the mean time, and hence need to be replaced, then some manual intervention is needed by the admin anyways --- at least he/she needs to shutdown the broker to initiate the leader migration --- then could we just issue an admin request enforcing a leader migration instead (which would not be restricted by unclean.leader.election config) of letting Kafka to re-assign itself? > Replication protocol deficiencies with workloads requiring high durability > guarantees > ------------------------------------------------------------------------------------- > > Key: KAFKA-10853 > URL: https://issues.apache.org/jira/browse/KAFKA-10853 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.4.0 > Reporter: Kyle Ambroff-Kao > Priority: Major > > *tl;dr: The definition of ISR and the consistency model from the perspective > of the producer seem a bit out of sync* > We have many systems in production that trade off availability in order to > provide stronger consistency guarantees. Most of these configurations look > like this: > Topic configuration: > * replication factor 3 > * min.insync.replicas=2 > * unclean.leader.election.enable=false > Producer configuration: > * acks=all > Broker configuration: > * replica.lag.time.max.ms=10000 > So the goal here is to reduce the chance of ever dropping a message that the > leader has acknowledged to the producer. > This works great, except that we've found some situations in production where > we are forced to enable unclean leader election to recover, which we never > want to do. These situations all seem totally avoidable with some small > tweaks to the replication protocol. > *A scenario we've seen many times* > The following sequence of events are in time order: A replica set for a > topic-partition TP with leader L and replicas R1 and R2. All three replicas > are in ISR. > # Producer sends ProduceRequest R with acks=all that contains a message > batch to the leader L. > # L receives R and appends the batch it contains to the active segment of TP > but does not ack to the producer yet because the request was acks=all > # A storage fault occurs on L which makes all IOPS take a long time but > doesn't cause a hard failure. > # R1 and R2 send follower fetch requests to L which are infinitely delayed > due to the storage fault on L. > # 10 seconds after appending the batch and appending it to the log, L > shrinks the ISR, removing R1 and R2. This is because ISR is defined as at > most replica.lag.time.max.ms milliseconds behind the log append time of the > leader end offset. The leader end offset is a message that has not been > replicated yet. > The storage fault example in step 3 could easily be another kind of fault. > Say for example, L is partitioned from R1 and R2 but not from ZooKeeper or > the producer. > The producer never receives acknowledgement of the ProduceRequest because the > min.insync.replicas constraint was never satisfied. So in terms of data > consistency, everything is working fine. > The problem is recovering from this situation. If the fault on L is not a > temporary blip, then L needs to be replaced. But since L shrunk the ISR, the > only way that leadership can move to either R1 or R2 is to set > unclean.leader.election.enable=true. > This works but it is a potentially unsafe way to recover and move leadership. > It would be better to have other options. > *Recovery could be automatic in this scenario.* > If you think about it, from the perspective of the producer, the write was > not acknowledged, and therefore, L, R1 and R2 are actually in-sync. So it > should actually be totally safe for leadership to transition to either R1 or > R2. > It seems that the producer and the leader don't have fully compatible > definitions for what it means for the replica set to be in-sync. If the > leader L used different rules for defining ISR, it could allow self-healing > in this or similar scenarios, since the ISR would not shrink. > -- This message was sent by Atlassian Jira (v8.3.4#803005)