Jason Gustafson created KAFKA-6880:
--------------------------------------

             Summary: Zombie replicas must be fenced
                 Key: KAFKA-6880
                 URL: https://issues.apache.org/jira/browse/KAFKA-6880
             Project: Kafka
          Issue Type: Improvement
            Reporter: Jason Gustafson
            Assignee: Jason Gustafson


Let's say we have three replicas for a partition: 1, 2 ,and 3.

In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2 
replicates up to offset 50, but broker 3 is a little behind at offset 40. The 
high watermark is 40. 

Suppose that broker 2 has a zk session expiration event, but fails to detect it 
or fails to reestablish a session (e.g. due to a bug like KAFKA-6879), and it 
continues fetching from broker 1.

For whatever reason, broker 3 is elected the leader for epoch 1 beginning at 
offset 40. Broker 1 detects the leader change and truncates its log to offset 
40. Some new data is appended up to offset 60, which is fully replicated to 
broker 2. Broker 2 continues fetching from broker 1 at offset 50, but gets 
NOT_LEADER_FOR_PARTITION errors.

After some time, broker 1 becomes the leader again for epoch 2. Broker 1 begins 
at offset 60. Broker 2 is now able to fetch at offset 50 and append the last 10 
records in order to catch up. However, because it did not observed the leader 
changes, it never saw the need to truncate its log. Hence offsets 40-49 still 
reflect the uncommitted changes from epoch 0. Neither KIP-101 nor KIP-279 can 
fix this because the tail of the log is correct.

The basic problem is that zombie replicas are not fenced properly by the leader 
epoch. We actually observed a sequence roughly like this after a broker had 
partially deadlocked from KAFKA-6879. We should add this to fetch requests so 
that the leader can fence the zombie replicas.

A related problem is that we currently allow such zombie replicas to be added 
to the ISR even if they are in an offline state. The problem is that the 
controller will never elect them, so being part of the ISR does not give the 
guarantee availability that is intended. This would also be fixed by verifying 
replica leader epoch in fetch requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to