Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
> > I am ok with adding a note in the KIP but the note should say that it has > an elevated risk for this scenario due to increased probability of having > an aggressive local cleanup with Tiered Storage. > I would like to clarify that there is no elevated risk because of this KIP. This risk already exists with the current protocol where an empty follower for a tiered storage enabled topic catches up with the leader where it replicates data beginning from leader's local log start offset. In the current KIP, the follower starts from last tiered offset instead which is usually much ahead of leader's local log start offset and can never be behind the leader's local log start offset. Hence this offset is more likely to be available on the leader's local log compared to the local log start offset in this rare scenario. Therefore, the risk is actually lower when the follower starts with last tiered offset instead of the local log start offset. Abhijeet. On Wed, Jul 24, 2024 at 7:26 PM Divij Vaidya wrote: > The difference between the two scenarios you mentioned is that with Tiered > Storage, the chances of hitting this scenario increases since a user is > likely to have an aggressive setting for local disk data cleanup, which > would not be the case in empty followers catching up in a non-tiered > storage world. > > I am ok with adding a note in the KIP but the note should say that it has > an elevated risk for this scenario due to increased probability of having > an aggressive local cleanup with Tiered Storage. > > -- > Divij Vaidya > > > > On Wed, Jul 24, 2024 at 1:22 PM Abhijeet Kumar > > wrote: > > > Hi Divij, > > > > The rare scenario we are discussing is similar to an empty follower > trying > > to catch up with the leader for a topic that is not enabled with tiered > > storage. Consider the following steps: > > > > 1. Follower requests offset 0 from the leader. > > 2. Offset 0 is no more valid on the leader as its log start offset is 10, > > hence leader throws Out of Range error > > 3. Follower fetches the earliest offset from the leader and gets 10, then > > resets its Fetch offset to 10. > > 4. Follower requests offset 10 from the leader, but the previous log > start > > offset (10) is deleted from the leader and the new log start offset is > 15. > > Hence the leader throws an Out of Range error. The follower goes back to > > step 3 > > > > Even in this scenario, theoretically, the follower will never be able to > > catch up. Since this is an existing problem, that affects even regular > > replication for topics without tiered storage, should we take this up > > separately? > > I can add a small note in the KIP saying that this behavior for follower > > catchup is similar to a scenario when tiered storage is not enabled. > > > > Regards, > > Abhijeet. > > > > > > > > On Tue, Jul 23, 2024 at 4:49 PM Divij Vaidya > > wrote: > > > > > Thank you for your response Abhijeet. You have understood the scenario > > > correctly. For the purpose of discussion, please consider the latter > case > > > where offset 11 is not available on the leader anymore (it got cleaned > > > locally since the last tiered offset is 15). In such a case, you > > > mentioned, the follower will eventually be able to catch up with the > > leader > > > by resetting its fetch offset until the offset is available on the > > leader's > > > local log. Correct me if I am wrong but it is not guaranteed that it > will > > > eventually catch up because theoretically, everytime it asks for a > newer > > > fetch offset, the leader may have deleted it locally. I understand that > > it > > > is an edge case scenario which will only happen with configurations for > > > small segment sizes and aggressive cleaning but nevertheless, it is a > > > possible scenario. > > > > > > Do you agree that theoretically it is possible for the follower to loop > > > such that it is never able to catch up? > > > > > > We can proceed with the KIP with an understanding that this scenario is > > > rare and we are willing to accept the risk of it. In such a case, we > > should > > > add a detection mechanism for such a scenario in the KIP, so that if we > > > encounter this scenario, the user has a way to detect (and mitigate > it). > > > Alternatively, we can change the KIP design to ensure that we never > > > encounter this scenario. Given the rarity of the scenario, I am ok with > > > having a detection mechanism (metric?) in place and having this > scenario > > > documented as an acceptable risk in current design. > > > > > > -- > > > Divij Vaidya > > > > > > > > > > > > On Tue, Jul 23, 2024 at 11:55 AM Abhijeet Kumar < > > > abhijeet.cse@gmail.com> > > > wrote: > > > > > > > Hi Divij, > > > > > > > > Seems like there is some confusion about the new protocol for > fetching > > > from > > > > tiered offset. > > > > The scenario you are highlighting is where, > > > > Leader's Log Start Offset = 0 > > > > Last Tiered Offset = 10 > > > > > > > > Following
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
The difference between the two scenarios you mentioned is that with Tiered Storage, the chances of hitting this scenario increases since a user is likely to have an aggressive setting for local disk data cleanup, which would not be the case in empty followers catching up in a non-tiered storage world. I am ok with adding a note in the KIP but the note should say that it has an elevated risk for this scenario due to increased probability of having an aggressive local cleanup with Tiered Storage. -- Divij Vaidya On Wed, Jul 24, 2024 at 1:22 PM Abhijeet Kumar wrote: > Hi Divij, > > The rare scenario we are discussing is similar to an empty follower trying > to catch up with the leader for a topic that is not enabled with tiered > storage. Consider the following steps: > > 1. Follower requests offset 0 from the leader. > 2. Offset 0 is no more valid on the leader as its log start offset is 10, > hence leader throws Out of Range error > 3. Follower fetches the earliest offset from the leader and gets 10, then > resets its Fetch offset to 10. > 4. Follower requests offset 10 from the leader, but the previous log start > offset (10) is deleted from the leader and the new log start offset is 15. > Hence the leader throws an Out of Range error. The follower goes back to > step 3 > > Even in this scenario, theoretically, the follower will never be able to > catch up. Since this is an existing problem, that affects even regular > replication for topics without tiered storage, should we take this up > separately? > I can add a small note in the KIP saying that this behavior for follower > catchup is similar to a scenario when tiered storage is not enabled. > > Regards, > Abhijeet. > > > > On Tue, Jul 23, 2024 at 4:49 PM Divij Vaidya > wrote: > > > Thank you for your response Abhijeet. You have understood the scenario > > correctly. For the purpose of discussion, please consider the latter case > > where offset 11 is not available on the leader anymore (it got cleaned > > locally since the last tiered offset is 15). In such a case, you > > mentioned, the follower will eventually be able to catch up with the > leader > > by resetting its fetch offset until the offset is available on the > leader's > > local log. Correct me if I am wrong but it is not guaranteed that it will > > eventually catch up because theoretically, everytime it asks for a newer > > fetch offset, the leader may have deleted it locally. I understand that > it > > is an edge case scenario which will only happen with configurations for > > small segment sizes and aggressive cleaning but nevertheless, it is a > > possible scenario. > > > > Do you agree that theoretically it is possible for the follower to loop > > such that it is never able to catch up? > > > > We can proceed with the KIP with an understanding that this scenario is > > rare and we are willing to accept the risk of it. In such a case, we > should > > add a detection mechanism for such a scenario in the KIP, so that if we > > encounter this scenario, the user has a way to detect (and mitigate it). > > Alternatively, we can change the KIP design to ensure that we never > > encounter this scenario. Given the rarity of the scenario, I am ok with > > having a detection mechanism (metric?) in place and having this scenario > > documented as an acceptable risk in current design. > > > > -- > > Divij Vaidya > > > > > > > > On Tue, Jul 23, 2024 at 11:55 AM Abhijeet Kumar < > > abhijeet.cse@gmail.com> > > wrote: > > > > > Hi Divij, > > > > > > Seems like there is some confusion about the new protocol for fetching > > from > > > tiered offset. > > > The scenario you are highlighting is where, > > > Leader's Log Start Offset = 0 > > > Last Tiered Offset = 10 > > > > > > Following is the sequence of events that will happen: > > > > > > 1. Follower requests offset 0 from the leader > > > 2. Assuming offset 0 is not available locally (to arrive at your > > scenario), > > > Leader returns OffsetMovedToTieredStorageException > > > 3. Follower fetches the earliest pending upload offset and receives 11 > > > 4. Follower builds aux state from [0-10] and sets the fetch offset to > 11 > > > (This step corresponds to step 3 in your email) > > > > > > At this stage, even if the leader has uploaded more data and the > > > last-tiered offset has changed (say to 15), it will not matter > > > because offset 11 should still be available on the leader and when the > > > follower requests data with fetch offset 11, the leader > > > will return with a valid partition data response which the follower can > > > consume and proceed further. Even if the offset 11 is not > > > available anymore, the follower will eventually be able to catch up > with > > > the leader by resetting its fetch offset until the offset > > > is available on the leader's local log. Once it catches up, replication > > on > > > the follower can proceed. > > > > > > Regards, > > > Abhijeet. > > > > > > > > > > > > On Tue, Jul 2, 2024 at 3
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Divij, The rare scenario we are discussing is similar to an empty follower trying to catch up with the leader for a topic that is not enabled with tiered storage. Consider the following steps: 1. Follower requests offset 0 from the leader. 2. Offset 0 is no more valid on the leader as its log start offset is 10, hence leader throws Out of Range error 3. Follower fetches the earliest offset from the leader and gets 10, then resets its Fetch offset to 10. 4. Follower requests offset 10 from the leader, but the previous log start offset (10) is deleted from the leader and the new log start offset is 15. Hence the leader throws an Out of Range error. The follower goes back to step 3 Even in this scenario, theoretically, the follower will never be able to catch up. Since this is an existing problem, that affects even regular replication for topics without tiered storage, should we take this up separately? I can add a small note in the KIP saying that this behavior for follower catchup is similar to a scenario when tiered storage is not enabled. Regards, Abhijeet. On Tue, Jul 23, 2024 at 4:49 PM Divij Vaidya wrote: > Thank you for your response Abhijeet. You have understood the scenario > correctly. For the purpose of discussion, please consider the latter case > where offset 11 is not available on the leader anymore (it got cleaned > locally since the last tiered offset is 15). In such a case, you > mentioned, the follower will eventually be able to catch up with the leader > by resetting its fetch offset until the offset is available on the leader's > local log. Correct me if I am wrong but it is not guaranteed that it will > eventually catch up because theoretically, everytime it asks for a newer > fetch offset, the leader may have deleted it locally. I understand that it > is an edge case scenario which will only happen with configurations for > small segment sizes and aggressive cleaning but nevertheless, it is a > possible scenario. > > Do you agree that theoretically it is possible for the follower to loop > such that it is never able to catch up? > > We can proceed with the KIP with an understanding that this scenario is > rare and we are willing to accept the risk of it. In such a case, we should > add a detection mechanism for such a scenario in the KIP, so that if we > encounter this scenario, the user has a way to detect (and mitigate it). > Alternatively, we can change the KIP design to ensure that we never > encounter this scenario. Given the rarity of the scenario, I am ok with > having a detection mechanism (metric?) in place and having this scenario > documented as an acceptable risk in current design. > > -- > Divij Vaidya > > > > On Tue, Jul 23, 2024 at 11:55 AM Abhijeet Kumar < > abhijeet.cse@gmail.com> > wrote: > > > Hi Divij, > > > > Seems like there is some confusion about the new protocol for fetching > from > > tiered offset. > > The scenario you are highlighting is where, > > Leader's Log Start Offset = 0 > > Last Tiered Offset = 10 > > > > Following is the sequence of events that will happen: > > > > 1. Follower requests offset 0 from the leader > > 2. Assuming offset 0 is not available locally (to arrive at your > scenario), > > Leader returns OffsetMovedToTieredStorageException > > 3. Follower fetches the earliest pending upload offset and receives 11 > > 4. Follower builds aux state from [0-10] and sets the fetch offset to 11 > > (This step corresponds to step 3 in your email) > > > > At this stage, even if the leader has uploaded more data and the > > last-tiered offset has changed (say to 15), it will not matter > > because offset 11 should still be available on the leader and when the > > follower requests data with fetch offset 11, the leader > > will return with a valid partition data response which the follower can > > consume and proceed further. Even if the offset 11 is not > > available anymore, the follower will eventually be able to catch up with > > the leader by resetting its fetch offset until the offset > > is available on the leader's local log. Once it catches up, replication > on > > the follower can proceed. > > > > Regards, > > Abhijeet. > > > > > > > > On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya > > wrote: > > > > > Hi folks. > > > > > > I am late to the party but I have a question on the proposal. > > > > > > How are we preventing a situation such as the following: > > > > > > 1. Empty follower asks leader for 0 > > > 2. Leader compares 0 with last-tiered-offset, and responds with 11 > > (where10 > > > is last-tiered-offset) and a OffsetMovedToTieredException > > > 3. Follower builds aux state from [0-10] and sets the fetch offset to > 11 > > > 4. But leader has already uploaded more data and now the new > > > last-tiered-offset is 15 > > > 5. Go back to 2 > > > > > > This could cause a cycle where the replica will be stuck trying to > > > reconcile with the leader. > > > > > > -- > > > Divij Vaidya > > > > > > > > > > > > On Fri, Apr 26, 2024
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Thank you for your response Abhijeet. You have understood the scenario correctly. For the purpose of discussion, please consider the latter case where offset 11 is not available on the leader anymore (it got cleaned locally since the last tiered offset is 15). In such a case, you mentioned, the follower will eventually be able to catch up with the leader by resetting its fetch offset until the offset is available on the leader's local log. Correct me if I am wrong but it is not guaranteed that it will eventually catch up because theoretically, everytime it asks for a newer fetch offset, the leader may have deleted it locally. I understand that it is an edge case scenario which will only happen with configurations for small segment sizes and aggressive cleaning but nevertheless, it is a possible scenario. Do you agree that theoretically it is possible for the follower to loop such that it is never able to catch up? We can proceed with the KIP with an understanding that this scenario is rare and we are willing to accept the risk of it. In such a case, we should add a detection mechanism for such a scenario in the KIP, so that if we encounter this scenario, the user has a way to detect (and mitigate it). Alternatively, we can change the KIP design to ensure that we never encounter this scenario. Given the rarity of the scenario, I am ok with having a detection mechanism (metric?) in place and having this scenario documented as an acceptable risk in current design. -- Divij Vaidya On Tue, Jul 23, 2024 at 11:55 AM Abhijeet Kumar wrote: > Hi Divij, > > Seems like there is some confusion about the new protocol for fetching from > tiered offset. > The scenario you are highlighting is where, > Leader's Log Start Offset = 0 > Last Tiered Offset = 10 > > Following is the sequence of events that will happen: > > 1. Follower requests offset 0 from the leader > 2. Assuming offset 0 is not available locally (to arrive at your scenario), > Leader returns OffsetMovedToTieredStorageException > 3. Follower fetches the earliest pending upload offset and receives 11 > 4. Follower builds aux state from [0-10] and sets the fetch offset to 11 > (This step corresponds to step 3 in your email) > > At this stage, even if the leader has uploaded more data and the > last-tiered offset has changed (say to 15), it will not matter > because offset 11 should still be available on the leader and when the > follower requests data with fetch offset 11, the leader > will return with a valid partition data response which the follower can > consume and proceed further. Even if the offset 11 is not > available anymore, the follower will eventually be able to catch up with > the leader by resetting its fetch offset until the offset > is available on the leader's local log. Once it catches up, replication on > the follower can proceed. > > Regards, > Abhijeet. > > > > On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya > wrote: > > > Hi folks. > > > > I am late to the party but I have a question on the proposal. > > > > How are we preventing a situation such as the following: > > > > 1. Empty follower asks leader for 0 > > 2. Leader compares 0 with last-tiered-offset, and responds with 11 > (where10 > > is last-tiered-offset) and a OffsetMovedToTieredException > > 3. Follower builds aux state from [0-10] and sets the fetch offset to 11 > > 4. But leader has already uploaded more data and now the new > > last-tiered-offset is 15 > > 5. Go back to 2 > > > > This could cause a cycle where the replica will be stuck trying to > > reconcile with the leader. > > > > -- > > Divij Vaidya > > > > > > > > On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar < > abhijeet.cse@gmail.com > > > > > wrote: > > > > > Thank you all for your comments. As all the comments in the thread are > > > addressed, I am starting a Vote thread for the KIP. Please have a look. > > > > > > Regards. > > > Abhijeet. > > > > > > > > > > > > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen wrote: > > > > > > > Hi, Abhijeet, > > > > > > > > Thanks for the update. > > > > > > > > I have no more comments. > > > > > > > > Luke > > > > > > > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao > > > wrote: > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > Thanks for the updated KIP. It looks good to me. > > > > > > > > > > Jun > > > > > > > > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < > > > > > abhijeet.cse@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Jun, > > > > > > > > > > > > Please find my comments inline. > > > > > > > > > > > > > > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao > > > > > > wrote: > > > > > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > > > 1. I am wondering if we could achieve the same result by just > > > > lowering > > > > > > > local.retention.ms and local.retention.bytes. This also allows > > the > > > > > newly > > > > > > > started follower to build up the local data before serving the > > > >
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Divij, Seems like there is some confusion about the new protocol for fetching from tiered offset. The scenario you are highlighting is where, Leader's Log Start Offset = 0 Last Tiered Offset = 10 Following is the sequence of events that will happen: 1. Follower requests offset 0 from the leader 2. Assuming offset 0 is not available locally (to arrive at your scenario), Leader returns OffsetMovedToTieredStorageException 3. Follower fetches the earliest pending upload offset and receives 11 4. Follower builds aux state from [0-10] and sets the fetch offset to 11 (This step corresponds to step 3 in your email) At this stage, even if the leader has uploaded more data and the last-tiered offset has changed (say to 15), it will not matter because offset 11 should still be available on the leader and when the follower requests data with fetch offset 11, the leader will return with a valid partition data response which the follower can consume and proceed further. Even if the offset 11 is not available anymore, the follower will eventually be able to catch up with the leader by resetting its fetch offset until the offset is available on the leader's local log. Once it catches up, replication on the follower can proceed. Regards, Abhijeet. On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya wrote: > Hi folks. > > I am late to the party but I have a question on the proposal. > > How are we preventing a situation such as the following: > > 1. Empty follower asks leader for 0 > 2. Leader compares 0 with last-tiered-offset, and responds with 11 (where10 > is last-tiered-offset) and a OffsetMovedToTieredException > 3. Follower builds aux state from [0-10] and sets the fetch offset to 11 > 4. But leader has already uploaded more data and now the new > last-tiered-offset is 15 > 5. Go back to 2 > > This could cause a cycle where the replica will be stuck trying to > reconcile with the leader. > > -- > Divij Vaidya > > > > On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar > > wrote: > > > Thank you all for your comments. As all the comments in the thread are > > addressed, I am starting a Vote thread for the KIP. Please have a look. > > > > Regards. > > Abhijeet. > > > > > > > > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen wrote: > > > > > Hi, Abhijeet, > > > > > > Thanks for the update. > > > > > > I have no more comments. > > > > > > Luke > > > > > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao > > wrote: > > > > > > > Hi, Abhijeet, > > > > > > > > Thanks for the updated KIP. It looks good to me. > > > > > > > > Jun > > > > > > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < > > > > abhijeet.cse@gmail.com> > > > > wrote: > > > > > > > > > Hi Jun, > > > > > > > > > > Please find my comments inline. > > > > > > > > > > > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao > > > > wrote: > > > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > > > Thanks for the reply. > > > > > > > > > > > > 1. I am wondering if we could achieve the same result by just > > > lowering > > > > > > local.retention.ms and local.retention.bytes. This also allows > the > > > > newly > > > > > > started follower to build up the local data before serving the > > > consumer > > > > > > traffic. > > > > > > > > > > > > > > > > I am not sure I fully followed this. Do you mean we could lower the > > > > > local.retention (by size and time) > > > > > so that there is little data on the leader's local storage so that > > the > > > > > follower can quickly catch up with the leader? > > > > > > > > > > In that case, we will need to set small local retention across > > brokers > > > in > > > > > the cluster. It will have the undesired > > > > > effect where there will be increased remote log fetches for serving > > > > consume > > > > > requests, and this can cause > > > > > degradations. Also, this behaviour (of increased remote fetches) > will > > > > > happen on all brokers at all times, whereas in > > > > > the KIP we are restricting the behavior only to the newly > > bootstrapped > > > > > brokers and only until the time it fully builds > > > > > the local logs as per retention defined at the cluster level. > > > > > (Deprioritization of the broker could help reduce the impact > > > > > even further) > > > > > > > > > > > > > > > > > > > > > > 2. Have you updated the KIP? > > > > > > > > > > > > > > > > The KIP has been updated now. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < > > > > satish.dugg...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > +1 to Jun for adding the consumer fetching from a follower > > scenario > > > > > > > also to the existing section that talked about the drawback > when > > a > > > > > > > node built with last-tiered-offset has become a leader. As > > Abhijeet > > > > > > > mentioned, we plan to have a follow-up KIP that will address > > these > > > by > > > > > > > having a deprioritzation of these brokers.
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Following up on my previous comment: An alternative approach could be to have an empty follower start replication from last-tiered-offset (already available as part of listOffsets) inclusive. On the leader, we change the logic (based on a configurable threshold) on when we return OffsetMovedToTieredException vs. when we fetch from remote and return data to follower. As an example, the solution works as follows: 1. Follower asks the leader for fetch offset Y. 2. Leader compares if (last-tiered-offset - Y > Z), where Z is a configured threshold. If true, we will return OffsetMovedToTieredException and the follower will ask again with fetch offset = last-tiered-offset. If false, leader will fetch offset Y from remote and return it to the follower. The advantages of this approach over the proposed solution are: 1. we won't be in a cyclic situation as mentioned in my previous email 2. it works with existing protocol which returns last-tiered-offset, i.e. we won't have to make changes to the protocol to add the new Earliest-Pending-Upload-Offset The disadvantages of this approach over the proposed solution are: 1. on the leader, we may have to fetch some data from remote to respond to the follower. The amount of this data can be controlled via the configured value Z which can be set based on how aggressive the upload/archival process is. -- Divij Vaidya On Tue, Jul 2, 2024 at 12:25 PM Divij Vaidya wrote: > Hi folks. > > I am late to the party but I have a question on the proposal. > > How are we preventing a situation such as the following: > > 1. Empty follower asks leader for 0 > 2. Leader compares 0 with last-tiered-offset, and responds with 11 > (where10 is last-tiered-offset) and a OffsetMovedToTieredException > 3. Follower builds aux state from [0-10] and sets the fetch offset to 11 > 4. But leader has already uploaded more data and now the new > last-tiered-offset is 15 > 5. Go back to 2 > > This could cause a cycle where the replica will be stuck trying to > reconcile with the leader. > > -- > Divij Vaidya > > > > On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar > wrote: > >> Thank you all for your comments. As all the comments in the thread are >> addressed, I am starting a Vote thread for the KIP. Please have a look. >> >> Regards. >> Abhijeet. >> >> >> >> On Thu, Apr 25, 2024 at 6:08 PM Luke Chen wrote: >> >> > Hi, Abhijeet, >> > >> > Thanks for the update. >> > >> > I have no more comments. >> > >> > Luke >> > >> > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao >> wrote: >> > >> > > Hi, Abhijeet, >> > > >> > > Thanks for the updated KIP. It looks good to me. >> > > >> > > Jun >> > > >> > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < >> > > abhijeet.cse@gmail.com> >> > > wrote: >> > > >> > > > Hi Jun, >> > > > >> > > > Please find my comments inline. >> > > > >> > > > >> > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao >> > > wrote: >> > > > >> > > > > Hi, Abhijeet, >> > > > > >> > > > > Thanks for the reply. >> > > > > >> > > > > 1. I am wondering if we could achieve the same result by just >> > lowering >> > > > > local.retention.ms and local.retention.bytes. This also allows >> the >> > > newly >> > > > > started follower to build up the local data before serving the >> > consumer >> > > > > traffic. >> > > > > >> > > > >> > > > I am not sure I fully followed this. Do you mean we could lower the >> > > > local.retention (by size and time) >> > > > so that there is little data on the leader's local storage so that >> the >> > > > follower can quickly catch up with the leader? >> > > > >> > > > In that case, we will need to set small local retention across >> brokers >> > in >> > > > the cluster. It will have the undesired >> > > > effect where there will be increased remote log fetches for serving >> > > consume >> > > > requests, and this can cause >> > > > degradations. Also, this behaviour (of increased remote fetches) >> will >> > > > happen on all brokers at all times, whereas in >> > > > the KIP we are restricting the behavior only to the newly >> bootstrapped >> > > > brokers and only until the time it fully builds >> > > > the local logs as per retention defined at the cluster level. >> > > > (Deprioritization of the broker could help reduce the impact >> > > > even further) >> > > > >> > > > >> > > > > >> > > > > 2. Have you updated the KIP? >> > > > > >> > > > >> > > > The KIP has been updated now. >> > > > >> > > > >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Jun >> > > > > >> > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < >> > > satish.dugg...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > +1 to Jun for adding the consumer fetching from a follower >> scenario >> > > > > > also to the existing section that talked about the drawback >> when a >> > > > > > node built with last-tiered-offset has become a leader. As >> Abhijeet >> > > > > > mentioned, we plan to have a follow-up KIP that will address >> these >> > by >> > > > > > having a deprioritzation of
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi folks. I am late to the party but I have a question on the proposal. How are we preventing a situation such as the following: 1. Empty follower asks leader for 0 2. Leader compares 0 with last-tiered-offset, and responds with 11 (where10 is last-tiered-offset) and a OffsetMovedToTieredException 3. Follower builds aux state from [0-10] and sets the fetch offset to 11 4. But leader has already uploaded more data and now the new last-tiered-offset is 15 5. Go back to 2 This could cause a cycle where the replica will be stuck trying to reconcile with the leader. -- Divij Vaidya On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar wrote: > Thank you all for your comments. As all the comments in the thread are > addressed, I am starting a Vote thread for the KIP. Please have a look. > > Regards. > Abhijeet. > > > > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen wrote: > > > Hi, Abhijeet, > > > > Thanks for the update. > > > > I have no more comments. > > > > Luke > > > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao > wrote: > > > > > Hi, Abhijeet, > > > > > > Thanks for the updated KIP. It looks good to me. > > > > > > Jun > > > > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < > > > abhijeet.cse@gmail.com> > > > wrote: > > > > > > > Hi Jun, > > > > > > > > Please find my comments inline. > > > > > > > > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao > > > wrote: > > > > > > > > > Hi, Abhijeet, > > > > > > > > > > Thanks for the reply. > > > > > > > > > > 1. I am wondering if we could achieve the same result by just > > lowering > > > > > local.retention.ms and local.retention.bytes. This also allows the > > > newly > > > > > started follower to build up the local data before serving the > > consumer > > > > > traffic. > > > > > > > > > > > > > I am not sure I fully followed this. Do you mean we could lower the > > > > local.retention (by size and time) > > > > so that there is little data on the leader's local storage so that > the > > > > follower can quickly catch up with the leader? > > > > > > > > In that case, we will need to set small local retention across > brokers > > in > > > > the cluster. It will have the undesired > > > > effect where there will be increased remote log fetches for serving > > > consume > > > > requests, and this can cause > > > > degradations. Also, this behaviour (of increased remote fetches) will > > > > happen on all brokers at all times, whereas in > > > > the KIP we are restricting the behavior only to the newly > bootstrapped > > > > brokers and only until the time it fully builds > > > > the local logs as per retention defined at the cluster level. > > > > (Deprioritization of the broker could help reduce the impact > > > > even further) > > > > > > > > > > > > > > > > > > 2. Have you updated the KIP? > > > > > > > > > > > > > The KIP has been updated now. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < > > > satish.dugg...@gmail.com> > > > > > wrote: > > > > > > > > > > > +1 to Jun for adding the consumer fetching from a follower > scenario > > > > > > also to the existing section that talked about the drawback when > a > > > > > > node built with last-tiered-offset has become a leader. As > Abhijeet > > > > > > mentioned, we plan to have a follow-up KIP that will address > these > > by > > > > > > having a deprioritzation of these brokers. The deprioritization > of > > > > > > those brokers can be removed once they catchup until the local > log > > > > > > retention. > > > > > > > > > > > > Thanks, > > > > > > Satish. > > > > > > > > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen > wrote: > > > > > > > > > > > > > > Hi Abhijeet, > > > > > > > > > > > > > > Thanks for the KIP to improve the tiered storage feature! > > > > > > > > > > > > > > Questions: > > > > > > > 1. We could also get the "pending-upload-offset" and epoch via > > > remote > > > > > log > > > > > > > metadata, instead of adding a new API to fetch from the leader. > > > Could > > > > > you > > > > > > > explain why you choose the later approach, instead of the > former? > > > > > > > 2. > > > > > > > > We plan to have a follow-up KIP that will address both the > > > > > > > deprioritization > > > > > > > of these brokers from leadership, as well as > > > > > > > for consumption (when fetching from followers is allowed). > > > > > > > > > > > > > > I agree with Jun that we might need to make it clear all > possible > > > > > > drawbacks > > > > > > > that could have. So, could we add the drawbacks that Jun > > mentioned > > > > > about > > > > > > > the performance issue when consumer fetch from follower? > > > > > > > > > > > > > > 3. Could we add "Rejected Alternatives" section to the end of > the > > > KIP > > > > > to > > > > > > > add some of them? > > > > > > > Like the "ListOffsetRequest" approach VS > > > > > "Earliest-Pending-Upload-Offset" > > > > > > > approach, or getting the "Earliest-Pending-Upl
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Thank you all for your comments. As all the comments in the thread are addressed, I am starting a Vote thread for the KIP. Please have a look. Regards. Abhijeet. On Thu, Apr 25, 2024 at 6:08 PM Luke Chen wrote: > Hi, Abhijeet, > > Thanks for the update. > > I have no more comments. > > Luke > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao wrote: > > > Hi, Abhijeet, > > > > Thanks for the updated KIP. It looks good to me. > > > > Jun > > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < > > abhijeet.cse@gmail.com> > > wrote: > > > > > Hi Jun, > > > > > > Please find my comments inline. > > > > > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao > > wrote: > > > > > > > Hi, Abhijeet, > > > > > > > > Thanks for the reply. > > > > > > > > 1. I am wondering if we could achieve the same result by just > lowering > > > > local.retention.ms and local.retention.bytes. This also allows the > > newly > > > > started follower to build up the local data before serving the > consumer > > > > traffic. > > > > > > > > > > I am not sure I fully followed this. Do you mean we could lower the > > > local.retention (by size and time) > > > so that there is little data on the leader's local storage so that the > > > follower can quickly catch up with the leader? > > > > > > In that case, we will need to set small local retention across brokers > in > > > the cluster. It will have the undesired > > > effect where there will be increased remote log fetches for serving > > consume > > > requests, and this can cause > > > degradations. Also, this behaviour (of increased remote fetches) will > > > happen on all brokers at all times, whereas in > > > the KIP we are restricting the behavior only to the newly bootstrapped > > > brokers and only until the time it fully builds > > > the local logs as per retention defined at the cluster level. > > > (Deprioritization of the broker could help reduce the impact > > > even further) > > > > > > > > > > > > > > 2. Have you updated the KIP? > > > > > > > > > > The KIP has been updated now. > > > > > > > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < > > satish.dugg...@gmail.com> > > > > wrote: > > > > > > > > > +1 to Jun for adding the consumer fetching from a follower scenario > > > > > also to the existing section that talked about the drawback when a > > > > > node built with last-tiered-offset has become a leader. As Abhijeet > > > > > mentioned, we plan to have a follow-up KIP that will address these > by > > > > > having a deprioritzation of these brokers. The deprioritization of > > > > > those brokers can be removed once they catchup until the local log > > > > > retention. > > > > > > > > > > Thanks, > > > > > Satish. > > > > > > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen wrote: > > > > > > > > > > > > Hi Abhijeet, > > > > > > > > > > > > Thanks for the KIP to improve the tiered storage feature! > > > > > > > > > > > > Questions: > > > > > > 1. We could also get the "pending-upload-offset" and epoch via > > remote > > > > log > > > > > > metadata, instead of adding a new API to fetch from the leader. > > Could > > > > you > > > > > > explain why you choose the later approach, instead of the former? > > > > > > 2. > > > > > > > We plan to have a follow-up KIP that will address both the > > > > > > deprioritization > > > > > > of these brokers from leadership, as well as > > > > > > for consumption (when fetching from followers is allowed). > > > > > > > > > > > > I agree with Jun that we might need to make it clear all possible > > > > > drawbacks > > > > > > that could have. So, could we add the drawbacks that Jun > mentioned > > > > about > > > > > > the performance issue when consumer fetch from follower? > > > > > > > > > > > > 3. Could we add "Rejected Alternatives" section to the end of the > > KIP > > > > to > > > > > > add some of them? > > > > > > Like the "ListOffsetRequest" approach VS > > > > "Earliest-Pending-Upload-Offset" > > > > > > approach, or getting the "Earliest-Pending-Upload-Offset" from > > remote > > > > log > > > > > > metadata... etc. > > > > > > > > > > > > Thanks. > > > > > > Luke > > > > > > > > > > > > > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < > > > > > abhijeet.cse@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Christo, > > > > > > > > > > > > > > Please find my comments inline. > > > > > > > > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov < > > > > christolo...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hello Abhijeet and Jun, > > > > > > > > > > > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > > > > > > > > > > > re: Jun > > > > > > > > > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new > > timestamps > > > - > > > > in > > > > > > > > retrospect it should have been fairly obvious. I would need > to > > go > > > > an > > > > > > > update > > > > > > > > KIP-1005
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi, Abhijeet, Thanks for the update. I have no more comments. Luke On Thu, Apr 25, 2024 at 4:21 AM Jun Rao wrote: > Hi, Abhijeet, > > Thanks for the updated KIP. It looks good to me. > > Jun > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar < > abhijeet.cse@gmail.com> > wrote: > > > Hi Jun, > > > > Please find my comments inline. > > > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao > wrote: > > > > > Hi, Abhijeet, > > > > > > Thanks for the reply. > > > > > > 1. I am wondering if we could achieve the same result by just lowering > > > local.retention.ms and local.retention.bytes. This also allows the > newly > > > started follower to build up the local data before serving the consumer > > > traffic. > > > > > > > I am not sure I fully followed this. Do you mean we could lower the > > local.retention (by size and time) > > so that there is little data on the leader's local storage so that the > > follower can quickly catch up with the leader? > > > > In that case, we will need to set small local retention across brokers in > > the cluster. It will have the undesired > > effect where there will be increased remote log fetches for serving > consume > > requests, and this can cause > > degradations. Also, this behaviour (of increased remote fetches) will > > happen on all brokers at all times, whereas in > > the KIP we are restricting the behavior only to the newly bootstrapped > > brokers and only until the time it fully builds > > the local logs as per retention defined at the cluster level. > > (Deprioritization of the broker could help reduce the impact > > even further) > > > > > > > > > > 2. Have you updated the KIP? > > > > > > > The KIP has been updated now. > > > > > > > > > > Thanks, > > > > > > Jun > > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana < > satish.dugg...@gmail.com> > > > wrote: > > > > > > > +1 to Jun for adding the consumer fetching from a follower scenario > > > > also to the existing section that talked about the drawback when a > > > > node built with last-tiered-offset has become a leader. As Abhijeet > > > > mentioned, we plan to have a follow-up KIP that will address these by > > > > having a deprioritzation of these brokers. The deprioritization of > > > > those brokers can be removed once they catchup until the local log > > > > retention. > > > > > > > > Thanks, > > > > Satish. > > > > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen wrote: > > > > > > > > > > Hi Abhijeet, > > > > > > > > > > Thanks for the KIP to improve the tiered storage feature! > > > > > > > > > > Questions: > > > > > 1. We could also get the "pending-upload-offset" and epoch via > remote > > > log > > > > > metadata, instead of adding a new API to fetch from the leader. > Could > > > you > > > > > explain why you choose the later approach, instead of the former? > > > > > 2. > > > > > > We plan to have a follow-up KIP that will address both the > > > > > deprioritization > > > > > of these brokers from leadership, as well as > > > > > for consumption (when fetching from followers is allowed). > > > > > > > > > > I agree with Jun that we might need to make it clear all possible > > > > drawbacks > > > > > that could have. So, could we add the drawbacks that Jun mentioned > > > about > > > > > the performance issue when consumer fetch from follower? > > > > > > > > > > 3. Could we add "Rejected Alternatives" section to the end of the > KIP > > > to > > > > > add some of them? > > > > > Like the "ListOffsetRequest" approach VS > > > "Earliest-Pending-Upload-Offset" > > > > > approach, or getting the "Earliest-Pending-Upload-Offset" from > remote > > > log > > > > > metadata... etc. > > > > > > > > > > Thanks. > > > > > Luke > > > > > > > > > > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < > > > > abhijeet.cse@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Christo, > > > > > > > > > > > > Please find my comments inline. > > > > > > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov < > > > christolo...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hello Abhijeet and Jun, > > > > > > > > > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > > > > > > > > > re: Jun > > > > > > > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new > timestamps > > - > > > in > > > > > > > retrospect it should have been fairly obvious. I would need to > go > > > an > > > > > > update > > > > > > > KIP-1005 myself then, thank you for giving the useful > reference! > > > > > > > > > > > > > > 4. I think Abhijeet wants to rebuild state from > > > latest-tiered-offset > > > > and > > > > > > > fetch from latest-tiered-offset + 1 only for new replicas (or > > > > replicas > > > > > > > which experienced a disk failure) to decrease the time a > > partition > > > > spends > > > > > > > in under-replicated state. In other words, a follower which has > > > just > > > > > > fallen > > > > > > > out of ISR, but has local data will conti
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi, Abhijeet, Thanks for the updated KIP. It looks good to me. Jun On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar wrote: > Hi Jun, > > Please find my comments inline. > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao wrote: > > > Hi, Abhijeet, > > > > Thanks for the reply. > > > > 1. I am wondering if we could achieve the same result by just lowering > > local.retention.ms and local.retention.bytes. This also allows the newly > > started follower to build up the local data before serving the consumer > > traffic. > > > > I am not sure I fully followed this. Do you mean we could lower the > local.retention (by size and time) > so that there is little data on the leader's local storage so that the > follower can quickly catch up with the leader? > > In that case, we will need to set small local retention across brokers in > the cluster. It will have the undesired > effect where there will be increased remote log fetches for serving consume > requests, and this can cause > degradations. Also, this behaviour (of increased remote fetches) will > happen on all brokers at all times, whereas in > the KIP we are restricting the behavior only to the newly bootstrapped > brokers and only until the time it fully builds > the local logs as per retention defined at the cluster level. > (Deprioritization of the broker could help reduce the impact > even further) > > > > > > 2. Have you updated the KIP? > > > > The KIP has been updated now. > > > > > > Thanks, > > > > Jun > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana > > wrote: > > > > > +1 to Jun for adding the consumer fetching from a follower scenario > > > also to the existing section that talked about the drawback when a > > > node built with last-tiered-offset has become a leader. As Abhijeet > > > mentioned, we plan to have a follow-up KIP that will address these by > > > having a deprioritzation of these brokers. The deprioritization of > > > those brokers can be removed once they catchup until the local log > > > retention. > > > > > > Thanks, > > > Satish. > > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen wrote: > > > > > > > > Hi Abhijeet, > > > > > > > > Thanks for the KIP to improve the tiered storage feature! > > > > > > > > Questions: > > > > 1. We could also get the "pending-upload-offset" and epoch via remote > > log > > > > metadata, instead of adding a new API to fetch from the leader. Could > > you > > > > explain why you choose the later approach, instead of the former? > > > > 2. > > > > > We plan to have a follow-up KIP that will address both the > > > > deprioritization > > > > of these brokers from leadership, as well as > > > > for consumption (when fetching from followers is allowed). > > > > > > > > I agree with Jun that we might need to make it clear all possible > > > drawbacks > > > > that could have. So, could we add the drawbacks that Jun mentioned > > about > > > > the performance issue when consumer fetch from follower? > > > > > > > > 3. Could we add "Rejected Alternatives" section to the end of the KIP > > to > > > > add some of them? > > > > Like the "ListOffsetRequest" approach VS > > "Earliest-Pending-Upload-Offset" > > > > approach, or getting the "Earliest-Pending-Upload-Offset" from remote > > log > > > > metadata... etc. > > > > > > > > Thanks. > > > > Luke > > > > > > > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < > > > abhijeet.cse@gmail.com> > > > > wrote: > > > > > > > > > Hi Christo, > > > > > > > > > > Please find my comments inline. > > > > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov < > > christolo...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hello Abhijeet and Jun, > > > > > > > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > > > > > > > re: Jun > > > > > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps > - > > in > > > > > > retrospect it should have been fairly obvious. I would need to go > > an > > > > > update > > > > > > KIP-1005 myself then, thank you for giving the useful reference! > > > > > > > > > > > > 4. I think Abhijeet wants to rebuild state from > > latest-tiered-offset > > > and > > > > > > fetch from latest-tiered-offset + 1 only for new replicas (or > > > replicas > > > > > > which experienced a disk failure) to decrease the time a > partition > > > spends > > > > > > in under-replicated state. In other words, a follower which has > > just > > > > > fallen > > > > > > out of ISR, but has local data will continue using today's Tiered > > > Storage > > > > > > replication protocol (i.e. fetching from earliest-local). I > further > > > > > believe > > > > > > he has taken this approach so that local state of replicas which > > have > > > > > just > > > > > > fallen out of ISR isn't forcefully wiped thus leading to > situation > > 1. > > > > > > Abhijeet, have I understood (and summarised) what you are > proposing > > > > > > correctly? > > > > > > > > > > > > Yes, your understanding is co
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Jun, Please find my comments inline. On Thu, Apr 18, 2024 at 3:26 AM Jun Rao wrote: > Hi, Abhijeet, > > Thanks for the reply. > > 1. I am wondering if we could achieve the same result by just lowering > local.retention.ms and local.retention.bytes. This also allows the newly > started follower to build up the local data before serving the consumer > traffic. > I am not sure I fully followed this. Do you mean we could lower the local.retention (by size and time) so that there is little data on the leader's local storage so that the follower can quickly catch up with the leader? In that case, we will need to set small local retention across brokers in the cluster. It will have the undesired effect where there will be increased remote log fetches for serving consume requests, and this can cause degradations. Also, this behaviour (of increased remote fetches) will happen on all brokers at all times, whereas in the KIP we are restricting the behavior only to the newly bootstrapped brokers and only until the time it fully builds the local logs as per retention defined at the cluster level. (Deprioritization of the broker could help reduce the impact even further) > > 2. Have you updated the KIP? > The KIP has been updated now. > > Thanks, > > Jun > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana > wrote: > > > +1 to Jun for adding the consumer fetching from a follower scenario > > also to the existing section that talked about the drawback when a > > node built with last-tiered-offset has become a leader. As Abhijeet > > mentioned, we plan to have a follow-up KIP that will address these by > > having a deprioritzation of these brokers. The deprioritization of > > those brokers can be removed once they catchup until the local log > > retention. > > > > Thanks, > > Satish. > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen wrote: > > > > > > Hi Abhijeet, > > > > > > Thanks for the KIP to improve the tiered storage feature! > > > > > > Questions: > > > 1. We could also get the "pending-upload-offset" and epoch via remote > log > > > metadata, instead of adding a new API to fetch from the leader. Could > you > > > explain why you choose the later approach, instead of the former? > > > 2. > > > > We plan to have a follow-up KIP that will address both the > > > deprioritization > > > of these brokers from leadership, as well as > > > for consumption (when fetching from followers is allowed). > > > > > > I agree with Jun that we might need to make it clear all possible > > drawbacks > > > that could have. So, could we add the drawbacks that Jun mentioned > about > > > the performance issue when consumer fetch from follower? > > > > > > 3. Could we add "Rejected Alternatives" section to the end of the KIP > to > > > add some of them? > > > Like the "ListOffsetRequest" approach VS > "Earliest-Pending-Upload-Offset" > > > approach, or getting the "Earliest-Pending-Upload-Offset" from remote > log > > > metadata... etc. > > > > > > Thanks. > > > Luke > > > > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < > > abhijeet.cse@gmail.com> > > > wrote: > > > > > > > Hi Christo, > > > > > > > > Please find my comments inline. > > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov < > christolo...@gmail.com> > > > > wrote: > > > > > > > > > Hello Abhijeet and Jun, > > > > > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > > > > > re: Jun > > > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - > in > > > > > retrospect it should have been fairly obvious. I would need to go > an > > > > update > > > > > KIP-1005 myself then, thank you for giving the useful reference! > > > > > > > > > > 4. I think Abhijeet wants to rebuild state from > latest-tiered-offset > > and > > > > > fetch from latest-tiered-offset + 1 only for new replicas (or > > replicas > > > > > which experienced a disk failure) to decrease the time a partition > > spends > > > > > in under-replicated state. In other words, a follower which has > just > > > > fallen > > > > > out of ISR, but has local data will continue using today's Tiered > > Storage > > > > > replication protocol (i.e. fetching from earliest-local). I further > > > > believe > > > > > he has taken this approach so that local state of replicas which > have > > > > just > > > > > fallen out of ISR isn't forcefully wiped thus leading to situation > 1. > > > > > Abhijeet, have I understood (and summarised) what you are proposing > > > > > correctly? > > > > > > > > > > Yes, your understanding is correct. We want to limit the behavior > > changes > > > > only to new replicas. > > > > > > > > > > > > > 5. I think in today's Tiered Storage we know the leader's > > > > log-start-offset > > > > > from the FetchResponse and we can learn its local-log-start-offset > > from > > > > the > > > > > ListOffsets by asking for earliest-local timestamp (-4). But > granted, > > > > this > > > > > ought to be added as an add
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Luke, Thanks for your comments. Please find my responses inline. On Tue, Apr 9, 2024 at 2:08 PM Luke Chen wrote: > Hi Abhijeet, > > Thanks for the KIP to improve the tiered storage feature! > > Questions: > 1. We could also get the "pending-upload-offset" and epoch via remote log > metadata, instead of adding a new API to fetch from the leader. Could you > explain why you choose the later approach, instead of the former? > The remote log metadata could be tracking overlapping log segments. The maximum offset across the log segments it may be tracking, may not be the last-tiered-offset because of truncations during unclean leader election. Remote Log metadata alone is not sufficient to identify last-tiered-offset or pending-upload-offset. Only the leader knows the correct lineage of offsets that is required to identify the "pending-upload-offset". That is the reason we chose to add a new API to fetch this information from the leader. 2. > > We plan to have a follow-up KIP that will address both the > deprioritization > of these brokers from leadership, as well as > for consumption (when fetching from followers is allowed). > > I agree with Jun that we might need to make it clear all possible drawbacks > that could have. So, could we add the drawbacks that Jun mentioned about > the performance issue when consumer fetch from follower? > > Updated the KIP to call this out. > 3. Could we add "Rejected Alternatives" section to the end of the KIP to > add some of them? > Like the "ListOffsetRequest" approach VS "Earliest-Pending-Upload-Offset" > approach, or getting the "Earliest-Pending-Upload-Offset" from remote log > metadata... etc. > > Added the section on Rejected Alternatives > Thanks. > Luke > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar > wrote: > > > Hi Christo, > > > > Please find my comments inline. > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov > > wrote: > > > > > Hello Abhijeet and Jun, > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > re: Jun > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - in > > > retrospect it should have been fairly obvious. I would need to go an > > update > > > KIP-1005 myself then, thank you for giving the useful reference! > > > > > > 4. I think Abhijeet wants to rebuild state from latest-tiered-offset > and > > > fetch from latest-tiered-offset + 1 only for new replicas (or replicas > > > which experienced a disk failure) to decrease the time a partition > spends > > > in under-replicated state. In other words, a follower which has just > > fallen > > > out of ISR, but has local data will continue using today's Tiered > Storage > > > replication protocol (i.e. fetching from earliest-local). I further > > believe > > > he has taken this approach so that local state of replicas which have > > just > > > fallen out of ISR isn't forcefully wiped thus leading to situation 1. > > > Abhijeet, have I understood (and summarised) what you are proposing > > > correctly? > > > > > > Yes, your understanding is correct. We want to limit the behavior > changes > > only to new replicas. > > > > > > > 5. I think in today's Tiered Storage we know the leader's > > log-start-offset > > > from the FetchResponse and we can learn its local-log-start-offset from > > the > > > ListOffsets by asking for earliest-local timestamp (-4). But granted, > > this > > > ought to be added as an additional API call in the KIP. > > > > > > > > Yes, I clarified this in my reply to Jun. I will add this missing detail > in > > the KIP. > > > > > > > re: Abhijeet > > > > > > 101. I am still a bit confused as to why you want to include a new > offset > > > (i.e. pending-upload-offset) when you yourself mention that you could > use > > > an already existing offset (i.e. last-tiered-offset + 1). In essence, > you > > > end your Motivation with "In this KIP, we will focus only on the > follower > > > fetch protocol using the *last-tiered-offset*" and then in the > following > > > sections you talk about pending-upload-offset. I understand this might > be > > > classified as an implementation detail, but if you introduce a new > offset > > > (i.e. pending-upload-offset) you have to make a change to the > ListOffsets > > > API (i.e. introduce -6) and thus document it in this KIP as such. > > However, > > > the last-tiered-offset ought to already be exposed as part of KIP-1005 > > > (under implementation). Am I misunderstanding something here? > > > > > > > I have tried to clarify this in my reply to Jun. > > > > > The follower needs to build the local data starting from the offset > > > Earliest-Pending-Upload-Offset. Hence it needs the offset and the > > > corresponding leader-epoch. > > > There are two ways to do this: > > >1. We add support in ListOffsetRequest to be able to fetch this > offset > > > (and leader epoch) from the leader. > > >2. Or, fetch the tiered-offset (which is already supported). From > this > >
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi, Abhijeet, Thanks for the reply. 1. I am wondering if we could achieve the same result by just lowering local.retention.ms and local.retention.bytes. This also allows the newly started follower to build up the local data before serving the consumer traffic. 2. Have you updated the KIP? Thanks, Jun On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana wrote: > +1 to Jun for adding the consumer fetching from a follower scenario > also to the existing section that talked about the drawback when a > node built with last-tiered-offset has become a leader. As Abhijeet > mentioned, we plan to have a follow-up KIP that will address these by > having a deprioritzation of these brokers. The deprioritization of > those brokers can be removed once they catchup until the local log > retention. > > Thanks, > Satish. > > On Tue, 9 Apr 2024 at 14:08, Luke Chen wrote: > > > > Hi Abhijeet, > > > > Thanks for the KIP to improve the tiered storage feature! > > > > Questions: > > 1. We could also get the "pending-upload-offset" and epoch via remote log > > metadata, instead of adding a new API to fetch from the leader. Could you > > explain why you choose the later approach, instead of the former? > > 2. > > > We plan to have a follow-up KIP that will address both the > > deprioritization > > of these brokers from leadership, as well as > > for consumption (when fetching from followers is allowed). > > > > I agree with Jun that we might need to make it clear all possible > drawbacks > > that could have. So, could we add the drawbacks that Jun mentioned about > > the performance issue when consumer fetch from follower? > > > > 3. Could we add "Rejected Alternatives" section to the end of the KIP to > > add some of them? > > Like the "ListOffsetRequest" approach VS "Earliest-Pending-Upload-Offset" > > approach, or getting the "Earliest-Pending-Upload-Offset" from remote log > > metadata... etc. > > > > Thanks. > > Luke > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar < > abhijeet.cse@gmail.com> > > wrote: > > > > > Hi Christo, > > > > > > Please find my comments inline. > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov > > > wrote: > > > > > > > Hello Abhijeet and Jun, > > > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > > > re: Jun > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - in > > > > retrospect it should have been fairly obvious. I would need to go an > > > update > > > > KIP-1005 myself then, thank you for giving the useful reference! > > > > > > > > 4. I think Abhijeet wants to rebuild state from latest-tiered-offset > and > > > > fetch from latest-tiered-offset + 1 only for new replicas (or > replicas > > > > which experienced a disk failure) to decrease the time a partition > spends > > > > in under-replicated state. In other words, a follower which has just > > > fallen > > > > out of ISR, but has local data will continue using today's Tiered > Storage > > > > replication protocol (i.e. fetching from earliest-local). I further > > > believe > > > > he has taken this approach so that local state of replicas which have > > > just > > > > fallen out of ISR isn't forcefully wiped thus leading to situation 1. > > > > Abhijeet, have I understood (and summarised) what you are proposing > > > > correctly? > > > > > > > > Yes, your understanding is correct. We want to limit the behavior > changes > > > only to new replicas. > > > > > > > > > > 5. I think in today's Tiered Storage we know the leader's > > > log-start-offset > > > > from the FetchResponse and we can learn its local-log-start-offset > from > > > the > > > > ListOffsets by asking for earliest-local timestamp (-4). But granted, > > > this > > > > ought to be added as an additional API call in the KIP. > > > > > > > > > > > Yes, I clarified this in my reply to Jun. I will add this missing > detail in > > > the KIP. > > > > > > > > > > re: Abhijeet > > > > > > > > 101. I am still a bit confused as to why you want to include a new > offset > > > > (i.e. pending-upload-offset) when you yourself mention that you > could use > > > > an already existing offset (i.e. last-tiered-offset + 1). In > essence, you > > > > end your Motivation with "In this KIP, we will focus only on the > follower > > > > fetch protocol using the *last-tiered-offset*" and then in the > following > > > > sections you talk about pending-upload-offset. I understand this > might be > > > > classified as an implementation detail, but if you introduce a new > offset > > > > (i.e. pending-upload-offset) you have to make a change to the > ListOffsets > > > > API (i.e. introduce -6) and thus document it in this KIP as such. > > > However, > > > > the last-tiered-offset ought to already be exposed as part of > KIP-1005 > > > > (under implementation). Am I misunderstanding something here? > > > > > > > > > > I have tried to clarify this in my reply to Jun. > > > > > > > The follower needs to build the local da
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
+1 to Jun for adding the consumer fetching from a follower scenario also to the existing section that talked about the drawback when a node built with last-tiered-offset has become a leader. As Abhijeet mentioned, we plan to have a follow-up KIP that will address these by having a deprioritzation of these brokers. The deprioritization of those brokers can be removed once they catchup until the local log retention. Thanks, Satish. On Tue, 9 Apr 2024 at 14:08, Luke Chen wrote: > > Hi Abhijeet, > > Thanks for the KIP to improve the tiered storage feature! > > Questions: > 1. We could also get the "pending-upload-offset" and epoch via remote log > metadata, instead of adding a new API to fetch from the leader. Could you > explain why you choose the later approach, instead of the former? > 2. > > We plan to have a follow-up KIP that will address both the > deprioritization > of these brokers from leadership, as well as > for consumption (when fetching from followers is allowed). > > I agree with Jun that we might need to make it clear all possible drawbacks > that could have. So, could we add the drawbacks that Jun mentioned about > the performance issue when consumer fetch from follower? > > 3. Could we add "Rejected Alternatives" section to the end of the KIP to > add some of them? > Like the "ListOffsetRequest" approach VS "Earliest-Pending-Upload-Offset" > approach, or getting the "Earliest-Pending-Upload-Offset" from remote log > metadata... etc. > > Thanks. > Luke > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar > wrote: > > > Hi Christo, > > > > Please find my comments inline. > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov > > wrote: > > > > > Hello Abhijeet and Jun, > > > > > > I have been mulling this KIP over a bit more in recent days! > > > > > > re: Jun > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - in > > > retrospect it should have been fairly obvious. I would need to go an > > update > > > KIP-1005 myself then, thank you for giving the useful reference! > > > > > > 4. I think Abhijeet wants to rebuild state from latest-tiered-offset and > > > fetch from latest-tiered-offset + 1 only for new replicas (or replicas > > > which experienced a disk failure) to decrease the time a partition spends > > > in under-replicated state. In other words, a follower which has just > > fallen > > > out of ISR, but has local data will continue using today's Tiered Storage > > > replication protocol (i.e. fetching from earliest-local). I further > > believe > > > he has taken this approach so that local state of replicas which have > > just > > > fallen out of ISR isn't forcefully wiped thus leading to situation 1. > > > Abhijeet, have I understood (and summarised) what you are proposing > > > correctly? > > > > > > Yes, your understanding is correct. We want to limit the behavior changes > > only to new replicas. > > > > > > > 5. I think in today's Tiered Storage we know the leader's > > log-start-offset > > > from the FetchResponse and we can learn its local-log-start-offset from > > the > > > ListOffsets by asking for earliest-local timestamp (-4). But granted, > > this > > > ought to be added as an additional API call in the KIP. > > > > > > > > Yes, I clarified this in my reply to Jun. I will add this missing detail in > > the KIP. > > > > > > > re: Abhijeet > > > > > > 101. I am still a bit confused as to why you want to include a new offset > > > (i.e. pending-upload-offset) when you yourself mention that you could use > > > an already existing offset (i.e. last-tiered-offset + 1). In essence, you > > > end your Motivation with "In this KIP, we will focus only on the follower > > > fetch protocol using the *last-tiered-offset*" and then in the following > > > sections you talk about pending-upload-offset. I understand this might be > > > classified as an implementation detail, but if you introduce a new offset > > > (i.e. pending-upload-offset) you have to make a change to the ListOffsets > > > API (i.e. introduce -6) and thus document it in this KIP as such. > > However, > > > the last-tiered-offset ought to already be exposed as part of KIP-1005 > > > (under implementation). Am I misunderstanding something here? > > > > > > > I have tried to clarify this in my reply to Jun. > > > > > The follower needs to build the local data starting from the offset > > > Earliest-Pending-Upload-Offset. Hence it needs the offset and the > > > corresponding leader-epoch. > > > There are two ways to do this: > > >1. We add support in ListOffsetRequest to be able to fetch this offset > > > (and leader epoch) from the leader. > > >2. Or, fetch the tiered-offset (which is already supported). From this > > > offset, we can get the Earliest-Pending-Upload-Offset. We can just add 1 > > to > > > the tiered-offset. > > > However, we still need the leader epoch for offset, since there is > > > no guarantee that the leader epoch for Earliest-Pending-Upload-Offset > > will >
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Abhijeet, Thanks for the KIP to improve the tiered storage feature! Questions: 1. We could also get the "pending-upload-offset" and epoch via remote log metadata, instead of adding a new API to fetch from the leader. Could you explain why you choose the later approach, instead of the former? 2. > We plan to have a follow-up KIP that will address both the deprioritization of these brokers from leadership, as well as for consumption (when fetching from followers is allowed). I agree with Jun that we might need to make it clear all possible drawbacks that could have. So, could we add the drawbacks that Jun mentioned about the performance issue when consumer fetch from follower? 3. Could we add "Rejected Alternatives" section to the end of the KIP to add some of them? Like the "ListOffsetRequest" approach VS "Earliest-Pending-Upload-Offset" approach, or getting the "Earliest-Pending-Upload-Offset" from remote log metadata... etc. Thanks. Luke On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar wrote: > Hi Christo, > > Please find my comments inline. > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov > wrote: > > > Hello Abhijeet and Jun, > > > > I have been mulling this KIP over a bit more in recent days! > > > > re: Jun > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - in > > retrospect it should have been fairly obvious. I would need to go an > update > > KIP-1005 myself then, thank you for giving the useful reference! > > > > 4. I think Abhijeet wants to rebuild state from latest-tiered-offset and > > fetch from latest-tiered-offset + 1 only for new replicas (or replicas > > which experienced a disk failure) to decrease the time a partition spends > > in under-replicated state. In other words, a follower which has just > fallen > > out of ISR, but has local data will continue using today's Tiered Storage > > replication protocol (i.e. fetching from earliest-local). I further > believe > > he has taken this approach so that local state of replicas which have > just > > fallen out of ISR isn't forcefully wiped thus leading to situation 1. > > Abhijeet, have I understood (and summarised) what you are proposing > > correctly? > > > > Yes, your understanding is correct. We want to limit the behavior changes > only to new replicas. > > > > 5. I think in today's Tiered Storage we know the leader's > log-start-offset > > from the FetchResponse and we can learn its local-log-start-offset from > the > > ListOffsets by asking for earliest-local timestamp (-4). But granted, > this > > ought to be added as an additional API call in the KIP. > > > > > Yes, I clarified this in my reply to Jun. I will add this missing detail in > the KIP. > > > > re: Abhijeet > > > > 101. I am still a bit confused as to why you want to include a new offset > > (i.e. pending-upload-offset) when you yourself mention that you could use > > an already existing offset (i.e. last-tiered-offset + 1). In essence, you > > end your Motivation with "In this KIP, we will focus only on the follower > > fetch protocol using the *last-tiered-offset*" and then in the following > > sections you talk about pending-upload-offset. I understand this might be > > classified as an implementation detail, but if you introduce a new offset > > (i.e. pending-upload-offset) you have to make a change to the ListOffsets > > API (i.e. introduce -6) and thus document it in this KIP as such. > However, > > the last-tiered-offset ought to already be exposed as part of KIP-1005 > > (under implementation). Am I misunderstanding something here? > > > > I have tried to clarify this in my reply to Jun. > > > The follower needs to build the local data starting from the offset > > Earliest-Pending-Upload-Offset. Hence it needs the offset and the > > corresponding leader-epoch. > > There are two ways to do this: > >1. We add support in ListOffsetRequest to be able to fetch this offset > > (and leader epoch) from the leader. > >2. Or, fetch the tiered-offset (which is already supported). From this > > offset, we can get the Earliest-Pending-Upload-Offset. We can just add 1 > to > > the tiered-offset. > > However, we still need the leader epoch for offset, since there is > > no guarantee that the leader epoch for Earliest-Pending-Upload-Offset > will > > be the same as the > > leader epoch for tiered-offset. We may need another API call to the > > leader for this. > > I prefer the first approach. The only problem with the first approach is > > that it introduces one more offset. The second approach avoids this > problem > > but is a little complicated. > > > > > > > Best, > > Christo > > > > On Thu, 4 Apr 2024 at 19:37, Jun Rao wrote: > > > > > Hi, Abhijeet, > > > > > > Thanks for the KIP. Left a few comments. > > > > > > 1. "A drawback of using the last-tiered-offset is that this new > follower > > > would possess only a limited number of locally stored segments. Should > it > > > ascend to the role of leader, there is a risk of n
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Christo, Please find my comments inline. On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov wrote: > Hello Abhijeet and Jun, > > I have been mulling this KIP over a bit more in recent days! > > re: Jun > > I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - in > retrospect it should have been fairly obvious. I would need to go an update > KIP-1005 myself then, thank you for giving the useful reference! > > 4. I think Abhijeet wants to rebuild state from latest-tiered-offset and > fetch from latest-tiered-offset + 1 only for new replicas (or replicas > which experienced a disk failure) to decrease the time a partition spends > in under-replicated state. In other words, a follower which has just fallen > out of ISR, but has local data will continue using today's Tiered Storage > replication protocol (i.e. fetching from earliest-local). I further believe > he has taken this approach so that local state of replicas which have just > fallen out of ISR isn't forcefully wiped thus leading to situation 1. > Abhijeet, have I understood (and summarised) what you are proposing > correctly? > > Yes, your understanding is correct. We want to limit the behavior changes only to new replicas. > 5. I think in today's Tiered Storage we know the leader's log-start-offset > from the FetchResponse and we can learn its local-log-start-offset from the > ListOffsets by asking for earliest-local timestamp (-4). But granted, this > ought to be added as an additional API call in the KIP. > > Yes, I clarified this in my reply to Jun. I will add this missing detail in the KIP. > re: Abhijeet > > 101. I am still a bit confused as to why you want to include a new offset > (i.e. pending-upload-offset) when you yourself mention that you could use > an already existing offset (i.e. last-tiered-offset + 1). In essence, you > end your Motivation with "In this KIP, we will focus only on the follower > fetch protocol using the *last-tiered-offset*" and then in the following > sections you talk about pending-upload-offset. I understand this might be > classified as an implementation detail, but if you introduce a new offset > (i.e. pending-upload-offset) you have to make a change to the ListOffsets > API (i.e. introduce -6) and thus document it in this KIP as such. However, > the last-tiered-offset ought to already be exposed as part of KIP-1005 > (under implementation). Am I misunderstanding something here? > I have tried to clarify this in my reply to Jun. > The follower needs to build the local data starting from the offset > Earliest-Pending-Upload-Offset. Hence it needs the offset and the > corresponding leader-epoch. > There are two ways to do this: >1. We add support in ListOffsetRequest to be able to fetch this offset > (and leader epoch) from the leader. >2. Or, fetch the tiered-offset (which is already supported). From this > offset, we can get the Earliest-Pending-Upload-Offset. We can just add 1 to > the tiered-offset. > However, we still need the leader epoch for offset, since there is > no guarantee that the leader epoch for Earliest-Pending-Upload-Offset will > be the same as the > leader epoch for tiered-offset. We may need another API call to the > leader for this. > I prefer the first approach. The only problem with the first approach is > that it introduces one more offset. The second approach avoids this problem > but is a little complicated. > > Best, > Christo > > On Thu, 4 Apr 2024 at 19:37, Jun Rao wrote: > > > Hi, Abhijeet, > > > > Thanks for the KIP. Left a few comments. > > > > 1. "A drawback of using the last-tiered-offset is that this new follower > > would possess only a limited number of locally stored segments. Should it > > ascend to the role of leader, there is a risk of needing to fetch these > > segments from the remote storage, potentially impacting broker > > performance." > > Since we support consumers fetching from followers, this is a potential > > issue on the follower side too. In theory, it's possible for a segment to > > be tiered immediately after rolling. In that case, there could be very > > little data after last-tiered-offset. It would be useful to think through > > how to address this issue. > > > > 2. ListOffsetsRequest: > > 2.1 Typically, we need to bump up the version of the request if we add a > > new value for timestamp. See > > > > > https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633 > > . > > 2.2 Since this changes the inter broker request protocol, it would be > > useful to have a section on upgrade (e.g. new IBP/metadata.version). > > > > 3. "Instead of fetching Earliest-Pending-Upload-Offset, it could fetch > the > > last-tiered-offset from the leader, and make a separate leader call to > > fetch leader epoch for the following offset." > > Why do we need to make a separate call for the leader epoch? > > ListOffsetsResponse include both the offset and the corresponding epo
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Jun, Thank you for taking the time to review the KIP. Please find my comments inline. On Fri, Apr 5, 2024 at 12:09 AM Jun Rao wrote: > Hi, Abhijeet, > > Thanks for the KIP. Left a few comments. > > 1. "A drawback of using the last-tiered-offset is that this new follower > would possess only a limited number of locally stored segments. Should it > ascend to the role of leader, there is a risk of needing to fetch these > segments from the remote storage, potentially impacting broker > performance." > Since we support consumers fetching from followers, this is a potential > issue on the follower side too. In theory, it's possible for a segment to > be tiered immediately after rolling. In that case, there could be very > little data after last-tiered-offset. It would be useful to think through > how to address this issue. > We plan to have a follow-up KIP that will address both the deprioritization of these brokers from leadership, as well as for consumption (when fetching from followers is allowed). > > 2. ListOffsetsRequest: > 2.1 Typically, we need to bump up the version of the request if we add a > new value for timestamp. See > > https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633 > . > Yes, let me update the KIP to include this change. We will need a new timestamp corresponding to Earliest-Pending-Upload-Offset. > 2.2 Since this changes the inter broker request protocol, it would be > useful to have a section on upgrade (e.g. new IBP/metadata.version). > > Make sense. I will update the KIP to capture this. > 3. "Instead of fetching Earliest-Pending-Upload-Offset, it could fetch the > last-tiered-offset from the leader, and make a separate leader call to > fetch leader epoch for the following offset." > Why do we need to make a separate call for the leader epoch? > ListOffsetsResponse include both the offset and the corresponding epoch. > > I understand there is some confusion here. Let me try to explain this. The follower needs to build the local data starting from the offset Earliest-Pending-Upload-Offset. Hence it needs the offset and the corresponding leader-epoch. There are two ways to do this: 1. We add support in ListOffsetRequest to be able to fetch this offset (and leader epoch) from the leader. 2. Or, fetch the tiered-offset (which is already supported). From this offset, we can get the Earliest-Pending-Upload-Offset. We can just add 1 to the tiered-offset. However, we still need the leader epoch for offset, since there is no guarantee that the leader epoch for Earliest-Pending-Upload-Offset will be the same as the leader epoch for tiered-offset. We may need another API call to the leader for this. I prefer the first approach. The only problem with the first approach is that it introduces one more offset. The second approach avoids this problem but is a little complicated. > 4. "Check if the follower replica is empty and if the feature to use > last-tiered-offset is enabled." > Why do we need to check if the follower replica is empty? > > We want to limit this new behavior only to new replicas. Replicas that become out of ISR are excluded from this behavior change. Those will continue with the existing behavior. > 5. It can be confirmed by checking if the leader's Log-Start-Offset is the > same as the Leader's Local-Log-Start-Offset. > How does the follower know Local-Log-Start-Offset? > Missed this detail. The follower will need to call the leader APi to fetch the EarliestLocal offset for this. > Jun > > On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar > > wrote: > > > Hi Christo, > > > > Thanks for reviewing the KIP. > > > > The follower needs the earliest-pending-upload-offset (and the > > corresponding leader epoch) from the leader. > > This is the first offset the follower will have locally. > > > > Regards, > > Abhijeet. > > > > > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov > > wrote: > > > > > Heya! > > > > > > First of all, thank you very much for the proposal, you have explained > > the > > > problem you want solved very well - I think a faster bootstrap of an > > empty > > > replica is definitely an improvement! > > > > > > For my understanding, which concrete offset do you want the leader to > > give > > > back to a follower - earliest-pending-upload-offset or the > > > latest-tiered-offset? If it is the second, then I believe KIP-1005 > ought > > to > > > already be exposing that offset as part of the ListOffsets API, no? > > > > > > Best, > > > Christo > > > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar < > abhijeet.cse@gmail.com > > > > > > wrote: > > > > > > > Hi All, > > > > > > > > I have created KIP-1023 to introduce follower fetch from tiered > offset. > > > > This feature will be helpful in significantly reducing Kafka > > > > rebalance/rebuild times when the cluster is enabled with tiered > > storage. > > > > > > > > > > > > > > > > > > https://cwiki.
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hello Abhijeet and Jun, I have been mulling this KIP over a bit more in recent days! re: Jun I wasn't aware we apply 2.1 and 2.2 for reserving new timestamps - in retrospect it should have been fairly obvious. I would need to go an update KIP-1005 myself then, thank you for giving the useful reference! 4. I think Abhijeet wants to rebuild state from latest-tiered-offset and fetch from latest-tiered-offset + 1 only for new replicas (or replicas which experienced a disk failure) to decrease the time a partition spends in under-replicated state. In other words, a follower which has just fallen out of ISR, but has local data will continue using today's Tiered Storage replication protocol (i.e. fetching from earliest-local). I further believe he has taken this approach so that local state of replicas which have just fallen out of ISR isn't forcefully wiped thus leading to situation 1. Abhijeet, have I understood (and summarised) what you are proposing correctly? 5. I think in today's Tiered Storage we know the leader's log-start-offset from the FetchResponse and we can learn its local-log-start-offset from the ListOffsets by asking for earliest-local timestamp (-4). But granted, this ought to be added as an additional API call in the KIP. re: Abhijeet 101. I am still a bit confused as to why you want to include a new offset (i.e. pending-upload-offset) when you yourself mention that you could use an already existing offset (i.e. last-tiered-offset + 1). In essence, you end your Motivation with "In this KIP, we will focus only on the follower fetch protocol using the *last-tiered-offset*" and then in the following sections you talk about pending-upload-offset. I understand this might be classified as an implementation detail, but if you introduce a new offset (i.e. pending-upload-offset) you have to make a change to the ListOffsets API (i.e. introduce -6) and thus document it in this KIP as such. However, the last-tiered-offset ought to already be exposed as part of KIP-1005 (under implementation). Am I misunderstanding something here? Best, Christo On Thu, 4 Apr 2024 at 19:37, Jun Rao wrote: > Hi, Abhijeet, > > Thanks for the KIP. Left a few comments. > > 1. "A drawback of using the last-tiered-offset is that this new follower > would possess only a limited number of locally stored segments. Should it > ascend to the role of leader, there is a risk of needing to fetch these > segments from the remote storage, potentially impacting broker > performance." > Since we support consumers fetching from followers, this is a potential > issue on the follower side too. In theory, it's possible for a segment to > be tiered immediately after rolling. In that case, there could be very > little data after last-tiered-offset. It would be useful to think through > how to address this issue. > > 2. ListOffsetsRequest: > 2.1 Typically, we need to bump up the version of the request if we add a > new value for timestamp. See > > https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633 > . > 2.2 Since this changes the inter broker request protocol, it would be > useful to have a section on upgrade (e.g. new IBP/metadata.version). > > 3. "Instead of fetching Earliest-Pending-Upload-Offset, it could fetch the > last-tiered-offset from the leader, and make a separate leader call to > fetch leader epoch for the following offset." > Why do we need to make a separate call for the leader epoch? > ListOffsetsResponse include both the offset and the corresponding epoch. > > 4. "Check if the follower replica is empty and if the feature to use > last-tiered-offset is enabled." > Why do we need to check if the follower replica is empty? > > 5. It can be confirmed by checking if the leader's Log-Start-Offset is the > same as the Leader's Local-Log-Start-Offset. > How does the follower know Local-Log-Start-Offset? > > Jun > > On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar > > wrote: > > > Hi Christo, > > > > Thanks for reviewing the KIP. > > > > The follower needs the earliest-pending-upload-offset (and the > > corresponding leader epoch) from the leader. > > This is the first offset the follower will have locally. > > > > Regards, > > Abhijeet. > > > > > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov > > wrote: > > > > > Heya! > > > > > > First of all, thank you very much for the proposal, you have explained > > the > > > problem you want solved very well - I think a faster bootstrap of an > > empty > > > replica is definitely an improvement! > > > > > > For my understanding, which concrete offset do you want the leader to > > give > > > back to a follower - earliest-pending-upload-offset or the > > > latest-tiered-offset? If it is the second, then I believe KIP-1005 > ought > > to > > > already be exposing that offset as part of the ListOffsets API, no? > > > > > > Best, > > > Christo > > > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar < > abhijeet.cse@gma
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi, Abhijeet, Thanks for the KIP. Left a few comments. 1. "A drawback of using the last-tiered-offset is that this new follower would possess only a limited number of locally stored segments. Should it ascend to the role of leader, there is a risk of needing to fetch these segments from the remote storage, potentially impacting broker performance." Since we support consumers fetching from followers, this is a potential issue on the follower side too. In theory, it's possible for a segment to be tiered immediately after rolling. In that case, there could be very little data after last-tiered-offset. It would be useful to think through how to address this issue. 2. ListOffsetsRequest: 2.1 Typically, we need to bump up the version of the request if we add a new value for timestamp. See https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633 . 2.2 Since this changes the inter broker request protocol, it would be useful to have a section on upgrade (e.g. new IBP/metadata.version). 3. "Instead of fetching Earliest-Pending-Upload-Offset, it could fetch the last-tiered-offset from the leader, and make a separate leader call to fetch leader epoch for the following offset." Why do we need to make a separate call for the leader epoch? ListOffsetsResponse include both the offset and the corresponding epoch. 4. "Check if the follower replica is empty and if the feature to use last-tiered-offset is enabled." Why do we need to check if the follower replica is empty? 5. It can be confirmed by checking if the leader's Log-Start-Offset is the same as the Leader's Local-Log-Start-Offset. How does the follower know Local-Log-Start-Offset? Jun On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar wrote: > Hi Christo, > > Thanks for reviewing the KIP. > > The follower needs the earliest-pending-upload-offset (and the > corresponding leader epoch) from the leader. > This is the first offset the follower will have locally. > > Regards, > Abhijeet. > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov > wrote: > > > Heya! > > > > First of all, thank you very much for the proposal, you have explained > the > > problem you want solved very well - I think a faster bootstrap of an > empty > > replica is definitely an improvement! > > > > For my understanding, which concrete offset do you want the leader to > give > > back to a follower - earliest-pending-upload-offset or the > > latest-tiered-offset? If it is the second, then I believe KIP-1005 ought > to > > already be exposing that offset as part of the ListOffsets API, no? > > > > Best, > > Christo > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar > > > wrote: > > > > > Hi All, > > > > > > I have created KIP-1023 to introduce follower fetch from tiered offset. > > > This feature will be helpful in significantly reducing Kafka > > > rebalance/rebuild times when the cluster is enabled with tiered > storage. > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset > > > > > > Feedback and suggestions are welcome. > > > > > > Regards, > > > Abhijeet. > > > > > >
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi Christo, Thanks for reviewing the KIP. The follower needs the earliest-pending-upload-offset (and the corresponding leader epoch) from the leader. This is the first offset the follower will have locally. Regards, Abhijeet. On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov wrote: > Heya! > > First of all, thank you very much for the proposal, you have explained the > problem you want solved very well - I think a faster bootstrap of an empty > replica is definitely an improvement! > > For my understanding, which concrete offset do you want the leader to give > back to a follower - earliest-pending-upload-offset or the > latest-tiered-offset? If it is the second, then I believe KIP-1005 ought to > already be exposing that offset as part of the ListOffsets API, no? > > Best, > Christo > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar > wrote: > > > Hi All, > > > > I have created KIP-1023 to introduce follower fetch from tiered offset. > > This feature will be helpful in significantly reducing Kafka > > rebalance/rebuild times when the cluster is enabled with tiered storage. > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset > > > > Feedback and suggestions are welcome. > > > > Regards, > > Abhijeet. > > >
Re: [DISCUSS] KIP-1023: Follower fetch from tiered offset
Heya! First of all, thank you very much for the proposal, you have explained the problem you want solved very well - I think a faster bootstrap of an empty replica is definitely an improvement! For my understanding, which concrete offset do you want the leader to give back to a follower - earliest-pending-upload-offset or the latest-tiered-offset? If it is the second, then I believe KIP-1005 ought to already be exposing that offset as part of the ListOffsets API, no? Best, Christo On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar wrote: > Hi All, > > I have created KIP-1023 to introduce follower fetch from tiered offset. > This feature will be helpful in significantly reducing Kafka > rebalance/rebuild times when the cluster is enabled with tiered storage. > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset > > Feedback and suggestions are welcome. > > Regards, > Abhijeet. >
[DISCUSS] KIP-1023: Follower fetch from tiered offset
Hi All, I have created KIP-1023 to introduce follower fetch from tiered offset. This feature will be helpful in significantly reducing Kafka rebalance/rebuild times when the cluster is enabled with tiered storage. https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset Feedback and suggestions are welcome. Regards, Abhijeet.