[ https://issues.apache.org/jira/browse/KAFKA-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785956#comment-17785956 ]
Kamal Chandraprakash edited comment on KAFKA-15376 at 11/14/23 4:29 PM: ------------------------------------------------------------------------ [~divijvaidya] The [example|https://github.com/apache/kafka/pull/13561#discussion_r1293286722] provided in the discussion is misleading. Let's divide the example into two to navigate it easier: Assume that there are two replicas Broker A and Broker B for partition tp0: *Case-1* Both the replicas A and B are insync on startup and they hold the leader-epoch 0. Then, the brokers started to go down in ping-pong fashion. Each broker will hold the following epoch in it's leader-epoch-checkpoint file: A: 0, 2, 4, 6, 8 B: 0, 1, 3, 5, 7 Since this is unclean-leader-election, the logs of Broker A and B might be diverged. As long as anyone of them is online, they continue to serve all the records according to the leader-epoch-checkpoint file. Once both the brokers becomes online, the follower truncates itself up-to the largest common log prefix offset so that the logs won't be diverged between the leader and follower. In this case, we continue to serve the data from the remote storage as no segments will be removed due to leader-epoch-cache truncation since both of them holds the LE0. Note that the approach taken here is similar to local-log where the broker will serve the log that they have until they sync with each other. *Case-2* Both the replicas A and B are out-of-sync on startup and the follower doesn't hold leader-epoch 0. Assume that Broker A is the leader and B is the follower & doesn't hold any data about the partition (empty-disk). When the Broker A goes down, there will be offline partition and B will be elected as unclean leader, the log-end-offset of the partition will be reset back to 0. >From the example provided in the discussion: At T1, Broker A {code:java} ----------------------------- leader-epoch | start-offset | ----------------------------- 0 0 1 180 2 400 ----------------------------- {code} At T2, Broker B, the start-offset will be reset back to 0: (Note that the leader does not interact with remote storage to find the next offset trade-off b/w availability and durabilty) {code:java} ----------------------------- leader-epoch | start-offset | ----------------------------- 3 0 4 780 6 900 7 990 ----------------------------- {code} Now, if we hold the data for both the lineage and ping-pong the brokers, we will be serving the diverged data back to the client for the same fetch-offset depends on the broker which is online. Once, the replicas start to interact with each other, they truncate the remote data themselves based on the current leader epoch lineage. The example provided in the discussion is applicable only for case-2 where the replicas never interacted among themselves at-least once. was (Author: ckamal): [~divijvaidya] The [example|https://github.com/apache/kafka/pull/13561#discussion_r1293286722] provided in the discussion is misleading. Let's divide the example into two to navigate it easier: Assume that there are two replicas Broker A and Broker B for partition tp0: *Case-1* Both the replicas A and B are insync on startup and they hold the leader-epoch 0. Then, the brokers started to go down in ping-pong fashion. Each broker will hold the following epoch in it's leader-epoch-checkpoint file: A: 0, 2, 4, 6, 8 B: 0, 1, 3, 5, 7 Since this is unclean-leader-election, the logs of Broker A and B might be diverged. As long as anyone of them is online, they continue to serve all the records according to the leader-epoch-checkpoint file. Once both the brokers becomes online, the follower truncates itself up-to the largest common log prefix offset so that the logs won't be diverged between the leader and follower. In this case, we continue to serve the data from the remote storage as no segments will be removed due to leader-epoch-cache truncation since both of them holds the LE0. Note that the approach taken here is similar to local-log where the broker will serve the log that they have until they sync with each other. *Case-2* Both the replicas A and B are out-of-sync on startup and the follower doesn't hold leader-epoch 0. Assume that Broker A is the leader and B is the follower & doesn't hold any data about the partition (empty-disk). When the Broker A goes down, there will be offline partition and B will be elected as unclean leader, the log-end-offset of the partition will be reset back to 0. >From the example provided in the discussion: At T1, Broker A {code:java} ----------------------------- leader-epoch | start-offset | ----------------------------- 0 0 1 180 2 400 ----------------------------- {code} At T2, Broker B, the start-offset will be reset back to 0: (Note that the leader does not interact with remote storage to find the next offset trade-off b/w availability and durabilty) {code:java} ----------------------------- leader-epoch | start-offset | ----------------------------- 3 0 4 780 6 900 7 990 ----------------------------- {code} Now, if we hold the data for both the lineage and ping-pong the brokers, we will be serving the diverged data back to the client for the same fetch-offset depends on the broker which is online. Once, the replicas start to interact with each other, they truncate the remote data themselves based on the current leader epoch lineage. The example provided in the discussion is applicable only when the replicas never interacted among themselves at-least once. > Explore options of removing data earlier to the current leader's leader epoch > lineage for topics enabled with tiered storage. > ----------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-15376 > URL: https://issues.apache.org/jira/browse/KAFKA-15376 > Project: Kafka > Issue Type: Task > Components: core > Reporter: Satish Duggana > Priority: Major > Fix For: 3.7.0 > > > Followup on the discussion thread: > [https://github.com/apache/kafka/pull/13561#discussion_r1288778006] > -- This message was sent by Atlassian Jira (v8.20.10#820010)