Jason Gustafson created KAFKA-6880:
--------------------------------------
Summary: Zombie replicas must be fenced
Key: KAFKA-6880
URL: https://issues.apache.org/jira/browse/KAFKA-6880
Project: Kafka
Issue Type: Improvement
Reporter: Jason Gustafson
Assignee: Jason Gustafson
Let's say we have three replicas for a partition: 1, 2 ,and 3.
In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2
replicates up to offset 50, but broker 3 is a little behind at offset 40. The
high watermark is 40.
Suppose that broker 2 has a zk session expiration event, but fails to detect it
or fails to reestablish a session (e.g. due to a bug like KAFKA-6879), and it
continues fetching from broker 1.
For whatever reason, broker 3 is elected the leader for epoch 1 beginning at
offset 40. Broker 1 detects the leader change and truncates its log to offset
40. Some new data is appended up to offset 60, which is fully replicated to
broker 2. Broker 2 continues fetching from broker 1 at offset 50, but gets
NOT_LEADER_FOR_PARTITION errors.
After some time, broker 1 becomes the leader again for epoch 2. Broker 1 begins
at offset 60. Broker 2 is now able to fetch at offset 50 and append the last 10
records in order to catch up. However, because it did not observed the leader
changes, it never saw the need to truncate its log. Hence offsets 40-49 still
reflect the uncommitted changes from epoch 0. Neither KIP-101 nor KIP-279 can
fix this because the tail of the log is correct.
The basic problem is that zombie replicas are not fenced properly by the leader
epoch. We actually observed a sequence roughly like this after a broker had
partially deadlocked from KAFKA-6879. We should add this to fetch requests so
that the leader can fence the zombie replicas.
A related problem is that we currently allow such zombie replicas to be added
to the ISR even if they are in an offline state. The problem is that the
controller will never elect them, so being part of the ISR does not give the
guarantee availability that is intended. This would also be fixed by verifying
replica leader epoch in fetch requests.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)