[
https://issues.apache.org/jira/browse/KAFKA-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404404#comment-15404404
]
Jun Rao commented on KAFKA-1211:
--------------------------------
[~fpj], for #1 and #2, there are a couple scenarios that this proposal can fix.
a. The first one is what's described in the original jira. Currently, when the
follower does truncation, it can truncate some previously committed messages.
If the follower immediately becomes the leader after truncation, we will lose
some previously committed messages. This is rare, but if it happens, it's bad.
The proposal fixes this case by preventing the follower from unnecessarily
truncating previously committed messages.
b. Another issue is that a portion of the log in different replicas may not
match in certain failure cases. This can happen when unclean leader election is
enabled. However, even if unclean leader election is disabled, mis-matching can
still happen when messages are lost due to power outage (see KAFKA-3919). The
proposal fixes this issue by making sure that the replicas are always identical.
For #3, the controller increases the leader generation every time the leader
changes. The latest leader generation is persisted in ZK.
For #4, putting the leader generation in the segment file name is another
possibility. One concern I had on that approach is dealing with compacted
topics. After compaction, it's possible there is only a small number (or even
just a single) messages left in a particular generation. Putting the generation
id in the segment file name will force us to have tiny segments, which is not
ideal. About the race condition, even with a separate checkpoint file, we can
avoid that. The sequencing will be (1) broker receives LeaderAndIsrRequest to
become leader; (2) broker stops fetching from current leader; (3) no new writes
can happen to this replica at this point; (4) broker writes the new leader
generation and log end offset to checkpoint file; (5) broker marks replica as
leader; (6) new writes can happen to this replica now.
For #5, it depends on who becomes the new leader in that case. If A becomes the
new leader (generation 3), then B and C will remove m1 and m2 and copy m3 and
m4 over from A. If B becomes the new leader, A will remove m3 and m4 and copy
m1 and m2 over from B. In either case, the replicas will be identical.
> Hold the produce request with ack > 1 in purgatory until replicas' HW has
> larger than the produce offset
> --------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-1211
> URL: https://issues.apache.org/jira/browse/KAFKA-1211
> Project: Kafka
> Issue Type: Bug
> Reporter: Guozhang Wang
> Assignee: Guozhang Wang
> Fix For: 0.11.0.0
>
>
> Today during leader failover we will have a weakness period when the
> followers truncate their data before fetching from the new leader, i.e.,
> number of in-sync replicas is just 1. If during this time the leader has also
> failed then produce requests with ack >1 that have get responded will still
> be lost. To avoid this scenario we would prefer to hold the produce request
> in purgatory until replica's HW has larger than the offset instead of just
> their end-of-log offsets.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)