[ https://issues.apache.org/jira/browse/KAFKA-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson updated KAFKA-7415: ----------------------------------- Fix Version/s: 1.1.2 > OffsetsForLeaderEpoch may incorrectly respond with undefined epoch causing > truncation to HW > ------------------------------------------------------------------------------------------- > > Key: KAFKA-7415 > URL: https://issues.apache.org/jira/browse/KAFKA-7415 > Project: Kafka > Issue Type: Bug > Components: replication > Affects Versions: 2.0.0 > Reporter: Anna Povzner > Assignee: Jason Gustafson > Priority: Major > Fix For: 1.1.2, 2.0.1, 2.1.0 > > > If the follower's last appended epoch is ahead of the leader's last appended > epoch, the OffsetsForLeaderEpoch response will incorrectly send > (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET), and the follower will truncate to > HW. This may lead to data loss in some rare cases where 2 back-to-back leader > elections happen (failure of one leader, followed by quick re-election of the > next leader due to preferred leader election, so that all replicas are still > in the ISR, and then failure of the 3rd leader). > The bug is in LeaderEpochFileCache.endOffsetFor(), which returns > (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET) if the requested leader epoch is > ahead of the last leader epoch in the cache. The method should return (last > leader epoch in the cache, LEO) in this scenario. > We don't create an entry in a leader epoch cache until a message is appended > with the new leader epoch. Every append to log calls > LeaderEpochFileCache.assign(). However, it would be much cleaner if > `makeLeader` created an entry in the cache as soon as replica becomes a > leader, which will fix the bug. In case the leader never appends any > messages, and the next leader epoch starts with the same offset, we already > have clearAndFlushLatest() that clears entries with start offsets greater or > equal to the passed offset. LeaderEpochFileCache.assign() could be merged > with clearAndFlushLatest(), so that we clear cache entries with offsets equal > or greater than the start offset of the new epoch, so that we do not need to > call these methods separately. > > Here is an example of a scenario where the issue leads to the data loss. > Suppose we have three replicas: r1, r2, and r3. Initially, the ISR consists > of (r1, r2, r3) and the leader is r1. The data up to offset 10 has been > committed to the ISR. Here is the initial state: > {code:java} > Leader: r1 > leader epoch: 0 > ISR(r1, r2, r3) > r1: [hw=10, leo=10] > r2: [hw=8, leo=10] > r3: [hw=5, leo=10] > {code} > Replica 1 fails and leaves the ISR, which makes Replica 2 the new leader with > leader epoch = 1. The leader appends a batch, but it is not replicated yet to > the followers. > {code:java} > Leader: r2 > leader epoch: 1 > ISR(r2, r3) > r1: [hw=10, leo=10] > r2: [hw=8, leo=11] > r3: [hw=5, leo=10] > {code} > Replica 3 is elected a leader (due to preferred leader election) before it > has a chance to truncate, with leader epoch 2. > {code:java} > Leader: r3 > leader epoch: 2 > ISR(r2, r3) > r1: [hw=10, leo=10] > r2: [hw=8, leo=11] > r3: [hw=5, leo=10] > {code} > Replica 2 sends OffsetsForLeaderEpoch(leader epoch = 1) to Replica 3. Replica > 3 incorrectly replies with UNDEFINED_EPOCH_OFFSET, and Replica 2 truncates to > HW. If Replica 3 fails before Replica 2 re-fetches the data, this may lead to > data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)