Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

jian fu Thu, 04 Dec 2025 05:54:59 -0800

Hi All:

I updated the KIP content according to Kamal and Haiying's discussion:
1  Explicitly emphasized that this is a topic-level optional feature
intended for users who prioritize cost.
1  Added  the cost-saving calculation example
2  Added  additional details about the operational drawback of this
feature: need extra disk expansion for the case: long time remote
storage's outage.
3  Added  the scenarios where it may not be very suitable/ beneficial to
enable the feature such as the topic's ratio for remote:local retention is
a very big value.


Thanks again for joining the discussion.

Regards
Jian

jian fu <[email protected]> 于2025年12月2日周二 20:27写道：

> Hi Kamal:
>
> I think I understand what you mean now. I’ve updated the picture in the
> link(https://github.com/apache/kafka/pull/20913#issuecomment-3601274230) .
> Could you help double-check whether we’ve reached the same understanding?
> In short. the drawback of this KIP is that, during a long time remote
> storage outage. it will occupied more disk. The max value is the redundant
> part we saving.
> Thus. After the outage recovered. It will come back to the beginning.
> Pls help to correct me if my understanding is wrong!  Thanks again.
>
> Regards
> Jian
>
> Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
> 19:29写道：
>
>> The already uploaded segments are eligible for deletion from the broker.
>> So, when remote storage is down,
>> then those segments can be deleted as per the local retention settings and
>> new segments can occupy those spaces.
>> This provides more time for the Admin to act when remote storage is down
>> for a longer time.
>>
>> This is from a reliability perspective.
>>
>> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]> wrote:
>>
>> > Hi Kamal and Haiying Cai:
>> >
>> > maybe you notice that my kafka clusters set 1day local + 3 days-7 days
>> > remote. thus  Haiying Cai‘s configure is 3 hours local + 3 days remote.
>> >
>> > I can explain more about my configure.
>> > I try to avoid the latency for some delay consumer to access the remote.
>> > Maybe some applications may encounter some unexpected issue. but we
>> need to
>> > give enough time to handle it. In the period, we don't want the
>> consumer to
>> > access the remote to hurt the whole kafka clusters. So one day is our
>> > expectation.
>> >
>> > I  saw one statement in  Haiying Cai  KIP1248:
>> > " Currently, when a new consumer or a fallen-off consumer requires
>> fetching
>> > messages from a while ago, and those messages are no longer present in
>> the
>> > Kafka broker's local storage, the broker must download the message from
>> the
>> > remote tiered storage and subsequently transfer the data back to the
>> > consumer.   "
>> > Extend the local retention time is how we try to avoid the issue
>> (Here, we
>> > don't consider the case one new consumer use the earliest strategy to
>> > consume. it is not often happen in our cases.)
>> >
>> > So. based my configure. I will see there is one day's duplicated segment
>> > wasting in remote storage. Thus I don't use them for real time analyst
>> or
>> > care about the fast reboot or some thing else.  So propose this KIP to
>> take
>> > one topic level optional feature to help us to reduce waste and save
>> money.
>> >
>> > Regards
>> > Jian
>> >
>> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道：
>> >
>> > > Hi  Kamal:
>> > >
>> > > Thanks for joining this discussion. Let me try to classify my
>> understands
>> > > for your good questions:
>> > >
>> > > 1  Kamal : Do you also have to update the RemoteCopy lag segments and
>> > > bytes metric?
>> > >     Jian:  The code just delay the upload time for local segment. So
>> it
>> > > seems there is no need to change any lag segments or metrics. right?
>> > >
>> > > 2   Kamal :  As Haiying mentioned, the segments get eventually
>> uploaded
>> > to
>> > > remote so not sure about the
>> > > benefit of this proposal. And, remote storage cost is considered as
>> low
>> > > when compared to broker local-disk.
>> > >      Jian: The cost benefit is about the total size for occupied. Take
>> > AWS
>> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer to
>> > > https://calculator.aws/#/createCalculator/S3).
>> > >   It is cheaper than local disk. So as I mentioned that the saving
>> money
>> > > depend on the ratio local vs remote retention time.  If your set the
>> > remote
>> > > storage time as a long time. The benefit is few, It is just avoiding
>> the
>> > > waste instead of cost saving.
>> > >   So I take it as topic level optional configure instead of default
>> > > feature.
>> > >
>> > > 3  Kamal:   It provides some cushion during third-party object storage
>> > > downtime.
>> > >      Jian:   I draw one picture to try to under the logic(
>> > > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230).
>> You
>> > > can help to check if my understanding is right. I seemed that no
>> > difference
>> > > for them. So for this question. maybe we need to discuss more about
>> it.
>> > The
>> > > only difference maybe we may increase a little local disk for temp
>> due to
>> > > the delay for upload remote. So in the original proposal. I want to
>> > upload
>> > > N-1 segments. But it seems the value is not much.
>> > >
>> > > BTW. I want to classify one basic rule: this feature isn't to change
>> the
>> > > default behavior. and the saving amount is not very big value in all
>> > cases.
>> > > It is suitable for part of topic which set a low ratio for
>> remote/local
>> > > such as 7days/1days or 3days/1day
>> > > At the last. Thanks again for your time and your comments. All the
>> > > questions are valid and good for us to thing more about it.
>> > >
>> > > Regards
>> > > Jian
>> > >
>> > >
>> > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二
>> > > 17:41写道：
>> > >
>> > >> 1. Do you also have to update the RemoteCopy lag segments and bytes
>> > >> metric?
>> > >> 2. As Haiying mentioned, the segments get eventually uploaded to
>> remote
>> > so
>> > >> not sure about the
>> > >> benefit of this proposal. And, remote storage cost is considered as
>> low
>> > >> when compared to broker local-disk.
>> > >> It provides some cushion during third-party object storage downtime.
>> > >>
>> > >>
>> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash <
>> > >> [email protected]> wrote:
>> > >>
>> > >> > Hi Jian,
>> > >> >
>> > >> > Thanks for the KIP!
>> > >> >
>> > >> > When remote storage is unavailable for a few hrs, then with lazy
>> > upload
>> > >> > there is a risk of the broker disk getting full soon.
>> > >> > The Admin has to configure the local retention configs properly.
>> With
>> > >> > eager upload, the disk utilization won't grow
>> > >> > until the local retention time (expectation is that all the
>> > >> > passive segments are uploaded). And, provides some time
>> > >> > for the Admin to take any action based on the situation.
>> > >> >
>> > >> > --
>> > >> > Kamal
>> > >> >
>> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev <
>> > >> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> >> Jian,
>> > >> >>
>> > >> >> Understands this is an optional feature and the cost saving
>> depends
>> > on
>> > >> >> the ratio between local.retention.ms and total retention.ms.
>> > >> >>
>> > >> >> In our setup, we have local.retention set to 3 hours and total
>> > >> retention
>> > >> >> set to 3 days, so the saving is not going to be significant.
>> > >> >>
>> > >> >> On 2025/12/01 05:33:11 jian fu wrote:
>> > >> >> > Hi Haiying Cai,
>> > >> >> >
>> > >> >> > Thanks for joining the discussion for this KIP. All of your
>> > concerns
>> > >> are
>> > >> >> > valid, and that is exactly why I introduced a topic-level
>> > >> configuration
>> > >> >> to
>> > >> >> > make this feature optional. This means that, by default, the
>> > behavior
>> > >> >> > remains unchanged. Only when users are not pursuing faster
>> broker
>> > >> boot
>> > >> >> time
>> > >> >> > or other optimizations — and care more about cost — would they
>> > enable
>> > >> >> this
>> > >> >> > option to some topics to save resources.
>> > >> >> >
>> > >> >> > Regarding cost self: the actual savings depend on the ratio
>> between
>> > >> >> local
>> > >> >> > retention and remote retention. In the KIP/PR, I provided a test
>> > >> >> example:
>> > >> >> > if we configure 1 day of local retention and 2 days of remote
>> > >> >> retention, we
>> > >> >> > can save about 50%. And realistically, I don't think anyone
>> would
>> > >> boldly
>> > >> >> > set local retention to a very small value (such as minutes) due
>> to
>> > >> the
>> > >> >> > latency concerns associated with remote storage. So in short,
>> the
>> > >> >> feature
>> > >> >> > will help reduce cost, and the amount saved simply depends on
>> the
>> > >> ratio.
>> > >> >> > Take my company's usage as real example, we configure most of
>> the
>> > >> >> topics: 1
>> > >> >> > day of local retention and 3–7 days of remote storage (3 days
>> for
>> > >> topic
>> > >> >> > with log/metric usage, 7 days for topic with normal business
>> > usage).
>> > >> >> and we
>> > >> >> > don't care about the boot speed and some thing else, This KIP
>> > allows
>> > >> us
>> > >> >> to
>> > >> >> > save 1/7 to 1/3 of the total disk usage for remote storage.
>> > >> >> >
>> > >> >> > Anyway, this is just a topic-level optional feature which don't
>> > >> reject
>> > >> >> the
>> > >> >> > benifit for current design. Thanks again for the discussion. I
>> can
>> > >> >> update
>> > >> >> > the KIP to better classify scenarios where this optional
>> feature is
>> > >> not
>> > >> >> > suitable. Currently, I only listed real-time analytics as the
>> > >> negative
>> > >> >> > example.
>> > >> >> >
>> > >> >> > Welcome further discussion to help make this KIP more complete.
>> > >> Thanks!
>> > >> >> >
>> > >> >> > Regards,
>> > >> >> > Jian
>> > >> >> >
>> > >> >> > Haiying Cai via dev <[email protected]> 于2025年12月1日周一
>> > 12:40写道：
>> > >> >> >
>> > >> >> > > Jian,
>> > >> >> > >
>> > >> >> > > Thanks for the contribution.  But I feel the uploading the
>> local
>> > >> >> segment
>> > >> >> > > file to remote storage ASAP is advantageous in several
>> scenarios:
>> > >> >> > >
>> > >> >> > > 1. Enable the fast bootstrapping a new broker.  A new broker
>> > >> doesn’t
>> > >> >> have
>> > >> >> > > to replicate all the data from the leader broker, it only
>> needs
>> > to
>> > >> >> > > replicate the data from the tail of the remote log segment to
>> the
>> > >> >> tail of
>> > >> >> > > the current end of the topic (LSO) since all the other data
>> are
>> > in
>> > >> the
>> > >> >> > > remote tiered storage and it can download them later lazily,
>> this
>> > >> is
>> > >> >> what
>> > >> >> > > KIP-1023 trying to solve;
>> > >> >> > > 2. Although nobody has proposed a KIP to allow a consumer
>> client
>> > to
>> > >> >> read
>> > >> >> > > from the remote tiered storage directly, but this will helps
>> the
>> > >> >> > > fall-behind consumer to do catch-up reads or perform the
>> > backfill.
>> > >> >> This
>> > >> >> > > path allows the consumer backfill to finish without polluting
>> the
>> > >> >> broker’s
>> > >> >> > > page cache.  The earlier the data is on the remote tiered
>> > storage,
>> > >> >> the more
>> > >> >> > > advantageous it is for the client.
>> > >> >> > >
>> > >> >> > > I think in your Proposal, you are delaying uploading the
>> segment
>> > >> but
>> > >> >> the
>> > >> >> > > file will still be uploaded at a later time, I guess this can
>> > >> saves a
>> > >> >> few
>> > >> >> > > hours storage cost for that file in the remote storage, not
>> sure
>> > >> >> whether
>> > >> >> > > that is a significant cost saved (if the file needs to stay in
>> > >> remote
>> > >> >> > > tiered storage for several days or weeks due to retention
>> > policy).
>> > >> >> > >
>> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote:
>> > >> >> > > > Hi everyone, I'd like to start a discussion on KIP-1241, the
>> > goal
>> > >> >> is to
>> > >> >> > > > reduce the remote storage. KIP:
>> > >> >> > > >
>> > >> >> > >
>> > >> >>
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload
>> > >> >> > > >
>> > >> >> > > > The Draft PR:   https://github.com/apache/kafka/pull/20913
>> > >> >> Problem:
>> > >> >> > > > Currently,
>> > >> >> > > > Kafka's tiered storage implementation uploads all non-active
>> > >> local
>> > >> >> log
>> > >> >> > > > segments to remote storage immediately, even when they are
>> > still
>> > >> >> within
>> > >> >> > > the
>> > >> >> > > > local retention period.
>> > >> >> > > > This results in redundant storage of the same data in both
>> > local
>> > >> and
>> > >> >> > > remote
>> > >> >> > > > tiers.
>> > >> >> > > >
>> > >> >> > > > When there is no requirement for real-time analytics or
>> > immediate
>> > >> >> > > > consumption based on remote storage. It has the following
>> > >> drawbacks:
>> > >> >> > > >
>> > >> >> > > > 1. Wastes storage capacity and costs: The same data is
>> stored
>> > >> twice
>> > >> >> > > during
>> > >> >> > > > the local retention window
>> > >> >> > > > 2. Provides no immediate benefit: During the local retention
>> > >> period,
>> > >> >> > > reads
>> > >> >> > > > prioritize local data, making the remote copy unnecessary
>> > >> >> > > >
>> > >> >> > > >
>> > >> >> > > > So. this KIP is to reduce tiered storage redundancy with
>> > delayed
>> > >> >> upload.
>> > >> >> > > > You can check the test result example here directly:
>> > >> >> > > >
>> > >> https://github.com/apache/kafka/pull/20913#issuecomment-3547156286
>> > >> >> > > > Looking forward to your feedback! Best regards, Jian
>> > >> >> > > >
>> > >> >> >
>> > >> >
>> > >> >
>> > >>
>> > >
>> > >
>> > >
>> > >
>> >
>>
>
>
>
>
>

Re: Re: [DISCUSS] KIP-1241: Reduce tiered storage redundancy with delayed upload

Reply via email to