[ 
https://issues.apache.org/jira/browse/KAFKA-15376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785956#comment-17785956
 ] 

Kamal Chandraprakash edited comment on KAFKA-15376 at 11/14/23 4:29 PM:
------------------------------------------------------------------------

[~divijvaidya] 

The [example|https://github.com/apache/kafka/pull/13561#discussion_r1293286722] 
provided in the discussion is misleading. Let's divide the example into two to 
navigate it easier:

Assume that there are two replicas Broker A and Broker B for partition tp0:

*Case-1*
Both the replicas A and B are insync on startup and they hold the leader-epoch 
0. Then, the brokers started to go down in ping-pong fashion. Each broker will 
hold the following epoch in it's leader-epoch-checkpoint file:

A: 0, 2, 4, 6, 8
B: 0, 1, 3, 5, 7

Since this is unclean-leader-election, the logs of Broker A and B might be 
diverged. As long as anyone of them is online, they continue to serve all the 
records according to the leader-epoch-checkpoint file. Once both the brokers 
becomes online, the follower truncates itself up-to the largest common log 
prefix offset so that the logs won't be diverged between the leader and 
follower. In this case, we continue to serve the data from the remote storage 
as no segments will be removed due to leader-epoch-cache truncation since both 
of them holds the LE0.

Note that the approach taken here is similar to local-log where the broker will 
serve the log that they have until they sync with each other.

*Case-2*
Both the replicas A and B are out-of-sync on startup and the follower doesn't 
hold leader-epoch 0. Assume that Broker A is the leader and B is the follower & 
doesn't hold any data about the partition (empty-disk). When the Broker A goes 
down, there will be offline partition and B will be elected as unclean leader, 
the log-end-offset of the partition will be reset back to 0.

>From the example provided in the discussion:

At T1, Broker A
{code:java}
-----------------------------
leader-epoch | start-offset |
-----------------------------
     0              0
     1              180
     2              400
----------------------------- {code}
At T2, Broker B, the start-offset will be reset back to 0: (Note that the 
leader does not interact with remote storage to find the next offset trade-off 
b/w availability and durabilty)
{code:java}
-----------------------------
leader-epoch | start-offset |
-----------------------------
     3              0
     4              780
     6              900
     7              990                                         
----------------------------- {code}
Now, if we hold the data for both the lineage and ping-pong the brokers, we 
will be serving the diverged data back to the client for the same fetch-offset 
depends on the broker which is online. Once, the replicas start to interact 
with each other, they truncate the remote data themselves based on the current 
leader epoch lineage.

The example provided in the discussion is applicable only for case-2 where the 
replicas never interacted among themselves at-least once. 


was (Author: ckamal):
[~divijvaidya] 

The [example|https://github.com/apache/kafka/pull/13561#discussion_r1293286722] 
provided in the discussion is misleading. Let's divide the example into two to 
navigate it easier:


Assume that there are two replicas Broker A and Broker B for partition tp0:

*Case-1*
Both the replicas A and B are insync on startup and they hold the leader-epoch 
0. Then, the brokers started to go down in ping-pong fashion. Each broker will 
hold the following epoch in it's leader-epoch-checkpoint file:

A: 0, 2, 4, 6, 8
B: 0, 1, 3, 5, 7

Since this is unclean-leader-election, the logs of Broker A and B might be 
diverged. As long as anyone of them is online, they continue to serve all the 
records according to the leader-epoch-checkpoint file. Once both the brokers 
becomes online, the follower truncates itself up-to the largest common log 
prefix offset so that the logs won't be diverged between the leader and 
follower. In this case, we continue to serve the data from the remote storage 
as no segments will be removed due to leader-epoch-cache truncation since both 
of them holds the LE0. 

Note that the approach taken here is similar to local-log where the broker will 
serve the log that they have until they sync with each other.

*Case-2*
Both the replicas A and B are out-of-sync on startup and the follower doesn't 
hold leader-epoch 0. Assume that Broker A is the leader and B is the follower & 
doesn't hold any data about the partition (empty-disk). When the Broker A goes 
down, there will be offline partition and B will be elected as unclean leader, 
the log-end-offset of the partition will be reset back to 0.

>From the example provided in the discussion:

At T1, Broker A

{code:java}
-----------------------------
leader-epoch | start-offset |
-----------------------------
     0              0
     1              180
     2              400
----------------------------- {code}

At T2, Broker B, the start-offset will be reset back to 0: (Note that the 
leader does not interact with remote storage to find the next offset trade-off 
b/w availability and durabilty)
{code:java}
-----------------------------
leader-epoch | start-offset |
-----------------------------
     3              0
     4              780
     6              900
     7              990                                         
----------------------------- {code}

Now, if we hold the data for both the lineage and ping-pong the brokers, we 
will be serving the diverged data back to the client for the same fetch-offset 
depends on the broker which is online. Once, the replicas start to interact 
with each other, they truncate the remote data themselves based on the current 
leader epoch lineage.

The example provided in the discussion is applicable only when the replicas 
never interacted among themselves at-least once. 

> Explore options of removing data earlier to the current leader's leader epoch 
> lineage for topics enabled with tiered storage.
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-15376
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15376
>             Project: Kafka
>          Issue Type: Task
>          Components: core
>            Reporter: Satish Duggana
>            Priority: Major
>             Fix For: 3.7.0
>
>
> Followup on the discussion thread:
> [https://github.com/apache/kafka/pull/13561#discussion_r1288778006]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to