[jira] [Created] (KAFKA-4414) Unexpected "Halting because log truncation is not allowed"

Meyer Kizner (JIRA) Tue, 15 Nov 2016 15:22:56 -0800

Meyer Kizner created KAFKA-4414:
-----------------------------------

             Summary: Unexpected "Halting because log truncation is not allowed"
                 Key: KAFKA-4414
                 URL: https://issues.apache.org/jira/browse/KAFKA-4414
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.9.0.1
            Reporter: Meyer Kizner



Our Kafka installation runs with unclean leader election disabled, so brokers 
halt when they find that their message offset is ahead of the leader's offset 
for a topic. We had two brokers halt today with this issue. After much time 
spent digging through the logs, I believe the following timeline describes what 
occurred and points to a plausible hypothesis as to what happened.

* B1, B2, and B3 are replicas of a topic, all in the ISR. B2 is currently the 
leader, but B1 is the preferred leader. The controller runs on B3.
* B1 fails, but the controller does not detect the failure immediately.
* B2 receives a message from a producer and B3 fetches it to stay up to date. 
B2 has not accepted the message, because B1 is down and so has not acknowledged 
the message.
* The controller triggers a preferred leader election, making B1 the leader, 
and notifies all replicas.
* Very shortly afterwards (~200ms), B1's broker registration in ZooKeeper 
expires, so the controller reassigns B2 to be leader again and notifies all 
replicas.
* Because B3 is the controller, while B2 is on another box, B3 hears about both 
of these events before B2 hears about either. B3 truncates its log to the high 
water mark (before the pending message) and resumes fetching from B2.
* B3 fetches the pending message from B2 again.
* B2 learns that it has been displaced and then reelected, and truncates its 
log to the high water mark, before the pending message.
* The next time B3 tries to fetch from B2, it sees that B2 is missing the 
pending message and halts.

In this case, there was no data loss or inconsistency. I haven't fully thought 
through whether either would be possible, but it seems likely that they would 
be, especially if there had been multiple producers to this topic.

I'm not completely certain about this timeline, but this sequence of events 
appears to at least be possible. Looking a bit through the controller code, 
there doesn't seem to be anything that forces {{LeaderAndIsrRequest}}s to be 
sent in a particular order. If someone with more knowledge of the code base 
believes this is incorrect, I'd be happy to post the logs and/or do some more 
digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4414) Unexpected "Halting because log truncation is not allowed"

Reply via email to