[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

2019-08-20 Thread Stanislav Kozlovski (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911080#comment-16911080
 ] 

Stanislav Kozlovski commented on KAFKA-6361:


[~hachikuji] [~apovzner] does this affect every version prior to 2.0.0? Asking 
because I'd like to fill out the `Affected Versions` tag - it's pretty useful 
when searching through JIRA

> Fast leader fail over can lead to log divergence between leader and follower
> 
>
> Key: KAFKA-6361
> URL: https://issues.apache.org/jira/browse/KAFKA-6361
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Anna Povzner
>Priority: Major
>  Labels: reliability
> Fix For: 2.0.0
>
>
> We have observed an edge case in the replication failover logic which can 
> cause a replica to permanently fall out of sync with the leader or, in the 
> worst case, actually have localized divergence between logs. This occurs in 
> spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It 
> appends two batches: one in the range (0, 10) and the other in the range (11, 
> 20). The first one successfully replicates to B, but the second one does not. 
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 
> 2. It appends a new batch with offsets (11, n) to its local log. So we now 
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the 
> follower, but before it is able to do so, broker B has its own zk session 
> expiration and broker A again becomes leader, now with epoch 3. It then 
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in 
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch 
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at 
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin 
> fetching from offset n, which will ultimately cause an out of order offset 
> error because broker A will return the full batch beginning from offset 11 
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in 
> the middle of the batch, it will truncate all the way to offset 10. It can 
> begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely 
> go unnoticed in practice and everything is fine in the third case. To 
> workaround the issue, we deleted the active segment on the replica which 
> allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't 
> aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} 
> instead of using the offset of the next highest epoch. That would cause the 
> follower to truncate using its high watermark. Or perhaps instead of doing 
> so, it could send another OffsetForLeaderEpoch request at the next previous 
> cached epoch and then truncate using that. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

2018-05-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469805#comment-16469805
 ] 

ASF GitHub Bot commented on KAFKA-6361:
---

junrao closed pull request #4882:  KAFKA-6361: Fix log divergence between 
leader and follower after fast leader fail over
URL: https://github.com/apache/kafka/pull/4882
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/clients/src/main/java/org/apache/kafka/common/protocol/CommonFields.java 
b/clients/src/main/java/org/apache/kafka/common/protocol/CommonFields.java
index a436dff1d03..7f43caf8696 100644
--- a/clients/src/main/java/org/apache/kafka/common/protocol/CommonFields.java
+++ b/clients/src/main/java/org/apache/kafka/common/protocol/CommonFields.java
@@ -26,6 +26,7 @@
 public static final Field.Int32 PARTITION_ID = new 
Field.Int32("partition", "Topic partition id");
 public static final Field.Int16 ERROR_CODE = new Field.Int16("error_code", 
"Response error code");
 public static final Field.NullableStr ERROR_MESSAGE = new 
Field.NullableStr("error_message", "Response error message");
+public static final Field.Int32 LEADER_EPOCH = new 
Field.Int32("leader_epoch", "The epoch");
 
 // Group APIs
 public static final Field.Str GROUP_ID = new Field.Str("group_id", "The 
unique group identifier");
diff --git 
a/clients/src/main/java/org/apache/kafka/common/requests/EpochEndOffset.java 
b/clients/src/main/java/org/apache/kafka/common/requests/EpochEndOffset.java
index 0965e3612d8..ce938aad4f1 100644
--- a/clients/src/main/java/org/apache/kafka/common/requests/EpochEndOffset.java
+++ b/clients/src/main/java/org/apache/kafka/common/requests/EpochEndOffset.java
@@ -20,24 +20,29 @@
 
 import static 
org.apache.kafka.common.record.RecordBatch.NO_PARTITION_LEADER_EPOCH;
 
+import java.util.Objects;
+
 /**
  * The offset, fetched from a leader, for a particular partition.
  */
 
 public class EpochEndOffset {
 public static final long UNDEFINED_EPOCH_OFFSET = 
NO_PARTITION_LEADER_EPOCH;
-public static final int UNDEFINED_EPOCH = -1;
+public static final int UNDEFINED_EPOCH = NO_PARTITION_LEADER_EPOCH;
 
 private Errors error;
+private int leaderEpoch;  // introduced in V1
 private long endOffset;
 
-public EpochEndOffset(Errors error, long endOffset) {
+public EpochEndOffset(Errors error, int leaderEpoch, long endOffset) {
 this.error = error;
+this.leaderEpoch = leaderEpoch;
 this.endOffset = endOffset;
 }
 
-public EpochEndOffset(long endOffset) {
+public EpochEndOffset(int leaderEpoch, long endOffset) {
 this.error = Errors.NONE;
+this.leaderEpoch = leaderEpoch;
 this.endOffset = endOffset;
 }
 
@@ -53,10 +58,15 @@ public long endOffset() {
 return endOffset;
 }
 
+public int leaderEpoch() {
+return leaderEpoch;
+}
+
 @Override
 public String toString() {
 return "EpochEndOffset{" +
 "error=" + error +
+", leaderEpoch=" + leaderEpoch +
 ", endOffset=" + endOffset +
 '}';
 }
@@ -68,14 +78,13 @@ public boolean equals(Object o) {
 
 EpochEndOffset that = (EpochEndOffset) o;
 
-if (error != that.error) return false;
-return endOffset == that.endOffset;
+return Objects.equals(error, that.error)
+   && Objects.equals(leaderEpoch, that.leaderEpoch)
+   && Objects.equals(endOffset, that.endOffset);
 }
 
 @Override
 public int hashCode() {
-int result = (int) error.code();
-result = 31 * result + (int) (endOffset ^ (endOffset >>> 32));
-return result;
+return Objects.hash(error, leaderEpoch, endOffset);
 }
 }
diff --git 
a/clients/src/main/java/org/apache/kafka/common/requests/OffsetsForLeaderEpochRequest.java
 
b/clients/src/main/java/org/apache/kafka/common/requests/OffsetsForLeaderEpochRequest.java
index d0585bed6d5..651416d97a7 100644
--- 
a/clients/src/main/java/org/apache/kafka/common/requests/OffsetsForLeaderEpochRequest.java
+++ 
b/clients/src/main/java/org/apache/kafka/common/requests/OffsetsForLeaderEpochRequest.java
@@ -50,8 +50,11 @@
 private static final Schema OFFSET_FOR_LEADER_EPOCH_REQUEST_V0 = new 
Schema(
 new Field(TOPICS_KEY_NAME, new 
ArrayOf(OFFSET_FOR_LEADER_EPOCH_REQUEST_TOPIC_V0), "An array of topics to get 
epochs for"));
 
+/* v1 request is the same as v0. Per-partition leader epoch has been added 
to response */
+private static final Schema OFFSET_FOR_LEADER_EPOCH_REQUEST_V1 = 
OFFSET_FOR_LEADER_EPOCH_REQUEST_V0;
+
 public static Schema[] sche

[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

2018-04-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439970#comment-16439970
 ] 

ASF GitHub Bot commented on KAFKA-6361:
---

apovzner opened a new pull request #4882:  KAFKA-6361: Fix log divergence 
between leader and follower after fast leader fail over
URL: https://github.com/apache/kafka/pull/4882
 
 
   WIP - will add few more unit tests.
   
   Implementation of KIP-279 as described here: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-279%3A+Fix+log+divergence+between+leader+and+follower+after+fast+leader+fail+over
   
   In summary:
   - Added leader_epoch to OFFSET_FOR_LEADER_EPOCH_RESPONSE
   - Leader replies with the pair( largest epoch less than or equal to the 
requested epoch, the end offset of this epoch)
   - If Follower does not know about the leader epoch that leader replies with, 
it truncates to the end offset of largest leader epoch less than leader epoch 
that leader replied with, and sends another OffsetForLeaderEpoch request. That 
request contains the largest leader epoch less than leader epoch that leader 
replied with.
   
   Added integration test 
EpochDrivenReplicationProtocolAcceptanceTest.logsShouldNotDivergeOnUncleanLeaderElections
 that does 3 fast leader changes where unclean leader election is enabled and 
min isr is 1. The test failed before the fix was implemented.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fast leader fail over can lead to log divergence between leader and follower
> 
>
> Key: KAFKA-6361
> URL: https://issues.apache.org/jira/browse/KAFKA-6361
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Anna Povzner
>Priority: Major
>  Labels: reliability
>
> We have observed an edge case in the replication failover logic which can 
> cause a replica to permanently fall out of sync with the leader or, in the 
> worst case, actually have localized divergence between logs. This occurs in 
> spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It 
> appends two batches: one in the range (0, 10) and the other in the range (11, 
> 20). The first one successfully replicates to B, but the second one does not. 
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 
> 2. It appends a new batch with offsets (11, n) to its local log. So we now 
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the 
> follower, but before it is able to do so, broker B has its own zk session 
> expiration and broker A again becomes leader, now with epoch 3. It then 
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in 
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch 
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at 
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin 
> fetching from offset n, which will ultimately cause an out of order offset 
> error because broker A will return the full batch beginning from offset 11 
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in 
> the middle of the batch, it will truncate all the way to offset 10. It can 
> begin fetching from offset 11 and

[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

2017-12-14 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291812#comment-16291812
 ] 

Jun Rao commented on KAFKA-6361:


[~hachikuji], thanks for the info. The problem in the description can indeed 
happen if the first leader change is due to preferred leader election, in which 
case, the ISR won't change.

To address this issue, we could send another OffsetForLeaderEpoch with the 
previous leader epoch as you suggested. This may require multiple rounds of 
OffsetForLeaderEpoch requests. Another way is to change OffsetForLeaderEpoch 
request to send a sequence of (leader epoch, start offset) for the epoch 
between the follower's HW and LEO. On the leader side, we find the longest 
consecutive sequence of leader epoch whose start offset matches the leader's. 
We then return the end offset of the last matching leader epoch.

The above approach doesn't fully fix the issue for a compacted topic. When all 
messages for a leader epoch are removed, we may lose the leader epoch. Thus, 
the leader epochs between the follower and the leader may not perfectly match. 
One way to address this issue is to preserve the offset of the first message in 
a leader epoch during log cleaning. This probably can be done separately since 
it causes problems rarely.

> Fast leader fail over can lead to log divergence between leader and follower
> 
>
> Key: KAFKA-6361
> URL: https://issues.apache.org/jira/browse/KAFKA-6361
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>  Labels: reliability
>
> We have observed an edge case in the replication failover logic which can 
> cause a replica to permanently fall out of sync with the leader or, in the 
> worst case, actually have localized divergence between logs. This occurs in 
> spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It 
> appends two batches: one in the range (0, 10) and the other in the range (11, 
> 20). The first one successfully replicates to B, but the second one does not. 
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 
> 2. It appends a new batch with offsets (11, n) to its local log. So we now 
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the 
> follower, but before it is able to do so, broker B has its own zk session 
> expiration and broker A again becomes leader, now with epoch 3. It then 
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in 
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch 
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at 
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin 
> fetching from offset n, which will ultimately cause an out of order offset 
> error because broker A will return the full batch beginning from offset 11 
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in 
> the middle of the batch, it will truncate all the way to offset 10. It can 
> begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely 
> go unnoticed in practice and everything is fine in the third case. To 
> workaround the issue, we deleted the active segment on the replica which 
> allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't 
> aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} 
> instead of using the offset of the next highest epoch. That would cause the 
> follower to truncate using its high watermark. Or perhaps instead of doing 
> so, it could

[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

2017-12-13 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290268#comment-16290268
 ] 

Jason Gustafson commented on KAFKA-6361:


Unclean leader election was disabled. It may not have been a session expiration 
that caused B to become leader (I supposed this, but it's not clear in the logs 
and I haven't seen controller logs yet).  In any case, when broker B took over, 
broker A was still in the ISR. Broker B appended the entry as described above 
and then attempted to shrink the ISR, but it failed to do so because of an 
invalid cached zk version. Broker A had already become leader at that point.

> Fast leader fail over can lead to log divergence between leader and follower
> 
>
> Key: KAFKA-6361
> URL: https://issues.apache.org/jira/browse/KAFKA-6361
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>
> We have observed an edge case in the replication failover logic which can 
> cause a replica to permanently fall out of sync with the leader or, in the 
> worst case, actually have localized divergence between logs. This occurs in 
> spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It 
> appends two batches: one in the range (0, 10) and the other in the range (11, 
> 20). The first one successfully replicates to B, but the second one does not. 
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 
> 2. It appends a new batch with offsets (11, n) to its local log. So we now 
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the 
> follower, but before it is able to do so, broker B has its own zk session 
> expiration and broker A again becomes leader, now with epoch 3. It then 
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in 
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch 
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at 
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin 
> fetching from offset n, which will ultimately cause an out of order offset 
> error because broker A will return the full batch beginning from offset 11 
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in 
> the middle of the batch, it will truncate all the way to offset 10. It can 
> begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely 
> go unnoticed in practice and everything is fine in the third case. To 
> workaround the issue, we deleted the active segment on the replica which 
> allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't 
> aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} 
> instead of using the offset of the next highest epoch. That would cause the 
> follower to truncate using its high watermark. Or perhaps instead of doing 
> so, it could send another OffsetForLeaderEpoch request at the next previous 
> cached epoch and then truncate using that. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6361) Fast leader fail over can lead to log divergence between leader and follower

2017-12-13 Thread Jun Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290252#comment-16290252
 ] 

Jun Rao commented on KAFKA-6361:


Was unclean leader election enabled in this case? When broker B takes over as 
the new leader because broker A's ZK session is expired, the controller is 
supposed to also shrink ISR to just {B}. If unclean leader election is 
disabled, when broker B's ZK session expires, broker A can't take over as the 
new leader since it's not in ISR.

In KIP-101, we didn't solve the problem with log divergence when an unclean 
leader election occurs 
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation#KIP-101-AlterReplicationProtocoltouseLeaderEpochratherthanHighWatermarkforTruncation-Appendix(a):PossibilityforDivergentLogswithLeaderEpochs&UncleanLeaderElection).
 Solving that problem requires more thoughts especially with compacted topics 
when certain leader epochs in the middle could have been fully garbage 
collected.

> Fast leader fail over can lead to log divergence between leader and follower
> 
>
> Key: KAFKA-6361
> URL: https://issues.apache.org/jira/browse/KAFKA-6361
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>
> We have observed an edge case in the replication failover logic which can 
> cause a replica to permanently fall out of sync with the leader or, in the 
> worst case, actually have localized divergence between logs. This occurs in 
> spite of the improved truncation logic from KIP-101. 
> Suppose we have brokers A and B. Initially A is the leader in epoch 1. It 
> appends two batches: one in the range (0, 10) and the other in the range (11, 
> 20). The first one successfully replicates to B, but the second one does not. 
> In other words, the logs on the brokers look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> {code}
> Broker A then has a zk session expiration and broker B is elected with epoch 
> 2. It appends a new batch with offsets (11, n) to its local log. So we now 
> have this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Normally we expect broker A to truncate to offset 11 on becoming the 
> follower, but before it is able to do so, broker B has its own zk session 
> expiration and broker A again becomes leader, now with epoch 3. It then 
> appends a new entry in the range (21, 30). The updated logs look like this:
> {code}
> Broker A:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets [11, 20], leader epoch: 1
> 2: offsets: [21, 30], leader epoch: 3
> Broker B:
> 0: offsets [0, 10], leader epoch: 1
> 1: offsets: [11, n], leader epoch: 2
> {code}
> Now what happens next depends on the last offset of the batch appended in 
> epoch 2. On becoming follower, broker B will send an OffsetForLeaderEpoch 
> request to broker A with epoch 2. Broker A will respond that epoch 2 ends at 
> offset 21. There are three cases:
> 1) n < 20: In this case, broker B will not do any truncation. It will begin 
> fetching from offset n, which will ultimately cause an out of order offset 
> error because broker A will return the full batch beginning from offset 11 
> which broker B will be unable to append.
> 2) n == 20: Again broker B does not truncate. It will fetch from offset 21 
> and everything will appear fine though the logs have actually diverged.
> 3) n > 20: Broker B will attempt to truncate to offset 21. Since this is in 
> the middle of the batch, it will truncate all the way to offset 10. It can 
> begin fetching from offset 11 and everything is fine.
> The case we have actually seen is the first one. The second one would likely 
> go unnoticed in practice and everything is fine in the third case. To 
> workaround the issue, we deleted the active segment on the replica which 
> allowed it to re-replicate consistently from the leader.
> I'm not sure the best solution for this scenario. Maybe if the leader isn't 
> aware of an epoch, it should always respond with {{UNDEFINED_EPOCH_OFFSET}} 
> instead of using the offset of the next highest epoch. That would cause the 
> follower to truncate using its high watermark. Or perhaps instead of doing 
> so, it could send another OffsetForLeaderEpoch request at the next previous 
> cached epoch and then truncate using that. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)