Hi All: I updated the KIP content according to Kamal and Haiying's discussion: 1 Explicitly emphasized that this is a topic-level optional feature intended for users who prioritize cost. 1 Added the cost-saving calculation example 2 Added additional details about the operational drawback of this feature: need extra disk expansion for the case: long time remote storage's outage. 3 Added the scenarios where it may not be very suitable/ beneficial to enable the feature such as the topic's ratio for remote:local retention is a very big value.
Thanks again for joining the discussion. Regards Jian jian fu <[email protected]> 于2025年12月2日周二 20:27写道: > Hi Kamal: > > I think I understand what you mean now. I’ve updated the picture in the > link(https://github.com/apache/kafka/pull/20913#issuecomment-3601274230) . > Could you help double-check whether we’ve reached the same understanding? > In short. the drawback of this KIP is that, during a long time remote > storage outage. it will occupied more disk. The max value is the redundant > part we saving. > Thus. After the outage recovered. It will come back to the beginning. > Pls help to correct me if my understanding is wrong! Thanks again. > > Regards > Jian > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二 > 19:29写道: > >> The already uploaded segments are eligible for deletion from the broker. >> So, when remote storage is down, >> then those segments can be deleted as per the local retention settings and >> new segments can occupy those spaces. >> This provides more time for the Admin to act when remote storage is down >> for a longer time. >> >> This is from a reliability perspective. >> >> On Tue, Dec 2, 2025 at 4:47 PM jian fu <[email protected]> wrote: >> >> > Hi Kamal and Haiying Cai: >> > >> > maybe you notice that my kafka clusters set 1day local + 3 days-7 days >> > remote. thus Haiying Cai‘s configure is 3 hours local + 3 days remote. >> > >> > I can explain more about my configure. >> > I try to avoid the latency for some delay consumer to access the remote. >> > Maybe some applications may encounter some unexpected issue. but we >> need to >> > give enough time to handle it. In the period, we don't want the >> consumer to >> > access the remote to hurt the whole kafka clusters. So one day is our >> > expectation. >> > >> > I saw one statement in Haiying Cai KIP1248: >> > " Currently, when a new consumer or a fallen-off consumer requires >> fetching >> > messages from a while ago, and those messages are no longer present in >> the >> > Kafka broker's local storage, the broker must download the message from >> the >> > remote tiered storage and subsequently transfer the data back to the >> > consumer. " >> > Extend the local retention time is how we try to avoid the issue >> (Here, we >> > don't consider the case one new consumer use the earliest strategy to >> > consume. it is not often happen in our cases.) >> > >> > So. based my configure. I will see there is one day's duplicated segment >> > wasting in remote storage. Thus I don't use them for real time analyst >> or >> > care about the fast reboot or some thing else. So propose this KIP to >> take >> > one topic level optional feature to help us to reduce waste and save >> money. >> > >> > Regards >> > Jian >> > >> > jian fu <[email protected]> 于2025年12月2日周二 18:42写道: >> > >> > > Hi Kamal: >> > > >> > > Thanks for joining this discussion. Let me try to classify my >> understands >> > > for your good questions: >> > > >> > > 1 Kamal : Do you also have to update the RemoteCopy lag segments and >> > > bytes metric? >> > > Jian: The code just delay the upload time for local segment. So >> it >> > > seems there is no need to change any lag segments or metrics. right? >> > > >> > > 2 Kamal : As Haiying mentioned, the segments get eventually >> uploaded >> > to >> > > remote so not sure about the >> > > benefit of this proposal. And, remote storage cost is considered as >> low >> > > when compared to broker local-disk. >> > > Jian: The cost benefit is about the total size for occupied. Take >> > AWS >> > > S3 as example. Tiered price for: 1 GB is 0.02 USD (You can refer to >> > > https://calculator.aws/#/createCalculator/S3). >> > > It is cheaper than local disk. So as I mentioned that the saving >> money >> > > depend on the ratio local vs remote retention time. If your set the >> > remote >> > > storage time as a long time. The benefit is few, It is just avoiding >> the >> > > waste instead of cost saving. >> > > So I take it as topic level optional configure instead of default >> > > feature. >> > > >> > > 3 Kamal: It provides some cushion during third-party object storage >> > > downtime. >> > > Jian: I draw one picture to try to under the logic( >> > > https://github.com/apache/kafka/pull/20913#issuecomment-3601274230). >> You >> > > can help to check if my understanding is right. I seemed that no >> > difference >> > > for them. So for this question. maybe we need to discuss more about >> it. >> > The >> > > only difference maybe we may increase a little local disk for temp >> due to >> > > the delay for upload remote. So in the original proposal. I want to >> > upload >> > > N-1 segments. But it seems the value is not much. >> > > >> > > BTW. I want to classify one basic rule: this feature isn't to change >> the >> > > default behavior. and the saving amount is not very big value in all >> > cases. >> > > It is suitable for part of topic which set a low ratio for >> remote/local >> > > such as 7days/1days or 3days/1day >> > > At the last. Thanks again for your time and your comments. All the >> > > questions are valid and good for us to thing more about it. >> > > >> > > Regards >> > > Jian >> > > >> > > >> > > Kamal Chandraprakash <[email protected]> 于2025年12月2日周二 >> > > 17:41写道: >> > > >> > >> 1. Do you also have to update the RemoteCopy lag segments and bytes >> > >> metric? >> > >> 2. As Haiying mentioned, the segments get eventually uploaded to >> remote >> > so >> > >> not sure about the >> > >> benefit of this proposal. And, remote storage cost is considered as >> low >> > >> when compared to broker local-disk. >> > >> It provides some cushion during third-party object storage downtime. >> > >> >> > >> >> > >> On Tue, Dec 2, 2025 at 2:45 PM Kamal Chandraprakash < >> > >> [email protected]> wrote: >> > >> >> > >> > Hi Jian, >> > >> > >> > >> > Thanks for the KIP! >> > >> > >> > >> > When remote storage is unavailable for a few hrs, then with lazy >> > upload >> > >> > there is a risk of the broker disk getting full soon. >> > >> > The Admin has to configure the local retention configs properly. >> With >> > >> > eager upload, the disk utilization won't grow >> > >> > until the local retention time (expectation is that all the >> > >> > passive segments are uploaded). And, provides some time >> > >> > for the Admin to take any action based on the situation. >> > >> > >> > >> > -- >> > >> > Kamal >> > >> > >> > >> > On Tue, Dec 2, 2025 at 10:28 AM Haiying Cai via dev < >> > >> [email protected]> >> > >> > wrote: >> > >> > >> > >> >> Jian, >> > >> >> >> > >> >> Understands this is an optional feature and the cost saving >> depends >> > on >> > >> >> the ratio between local.retention.ms and total retention.ms. >> > >> >> >> > >> >> In our setup, we have local.retention set to 3 hours and total >> > >> retention >> > >> >> set to 3 days, so the saving is not going to be significant. >> > >> >> >> > >> >> On 2025/12/01 05:33:11 jian fu wrote: >> > >> >> > Hi Haiying Cai, >> > >> >> > >> > >> >> > Thanks for joining the discussion for this KIP. All of your >> > concerns >> > >> are >> > >> >> > valid, and that is exactly why I introduced a topic-level >> > >> configuration >> > >> >> to >> > >> >> > make this feature optional. This means that, by default, the >> > behavior >> > >> >> > remains unchanged. Only when users are not pursuing faster >> broker >> > >> boot >> > >> >> time >> > >> >> > or other optimizations — and care more about cost — would they >> > enable >> > >> >> this >> > >> >> > option to some topics to save resources. >> > >> >> > >> > >> >> > Regarding cost self: the actual savings depend on the ratio >> between >> > >> >> local >> > >> >> > retention and remote retention. In the KIP/PR, I provided a test >> > >> >> example: >> > >> >> > if we configure 1 day of local retention and 2 days of remote >> > >> >> retention, we >> > >> >> > can save about 50%. And realistically, I don't think anyone >> would >> > >> boldly >> > >> >> > set local retention to a very small value (such as minutes) due >> to >> > >> the >> > >> >> > latency concerns associated with remote storage. So in short, >> the >> > >> >> feature >> > >> >> > will help reduce cost, and the amount saved simply depends on >> the >> > >> ratio. >> > >> >> > Take my company's usage as real example, we configure most of >> the >> > >> >> topics: 1 >> > >> >> > day of local retention and 3–7 days of remote storage (3 days >> for >> > >> topic >> > >> >> > with log/metric usage, 7 days for topic with normal business >> > usage). >> > >> >> and we >> > >> >> > don't care about the boot speed and some thing else, This KIP >> > allows >> > >> us >> > >> >> to >> > >> >> > save 1/7 to 1/3 of the total disk usage for remote storage. >> > >> >> > >> > >> >> > Anyway, this is just a topic-level optional feature which don't >> > >> reject >> > >> >> the >> > >> >> > benifit for current design. Thanks again for the discussion. I >> can >> > >> >> update >> > >> >> > the KIP to better classify scenarios where this optional >> feature is >> > >> not >> > >> >> > suitable. Currently, I only listed real-time analytics as the >> > >> negative >> > >> >> > example. >> > >> >> > >> > >> >> > Welcome further discussion to help make this KIP more complete. >> > >> Thanks! >> > >> >> > >> > >> >> > Regards, >> > >> >> > Jian >> > >> >> > >> > >> >> > Haiying Cai via dev <[email protected]> 于2025年12月1日周一 >> > 12:40写道: >> > >> >> > >> > >> >> > > Jian, >> > >> >> > > >> > >> >> > > Thanks for the contribution. But I feel the uploading the >> local >> > >> >> segment >> > >> >> > > file to remote storage ASAP is advantageous in several >> scenarios: >> > >> >> > > >> > >> >> > > 1. Enable the fast bootstrapping a new broker. A new broker >> > >> doesn’t >> > >> >> have >> > >> >> > > to replicate all the data from the leader broker, it only >> needs >> > to >> > >> >> > > replicate the data from the tail of the remote log segment to >> the >> > >> >> tail of >> > >> >> > > the current end of the topic (LSO) since all the other data >> are >> > in >> > >> the >> > >> >> > > remote tiered storage and it can download them later lazily, >> this >> > >> is >> > >> >> what >> > >> >> > > KIP-1023 trying to solve; >> > >> >> > > 2. Although nobody has proposed a KIP to allow a consumer >> client >> > to >> > >> >> read >> > >> >> > > from the remote tiered storage directly, but this will helps >> the >> > >> >> > > fall-behind consumer to do catch-up reads or perform the >> > backfill. >> > >> >> This >> > >> >> > > path allows the consumer backfill to finish without polluting >> the >> > >> >> broker’s >> > >> >> > > page cache. The earlier the data is on the remote tiered >> > storage, >> > >> >> the more >> > >> >> > > advantageous it is for the client. >> > >> >> > > >> > >> >> > > I think in your Proposal, you are delaying uploading the >> segment >> > >> but >> > >> >> the >> > >> >> > > file will still be uploaded at a later time, I guess this can >> > >> saves a >> > >> >> few >> > >> >> > > hours storage cost for that file in the remote storage, not >> sure >> > >> >> whether >> > >> >> > > that is a significant cost saved (if the file needs to stay in >> > >> remote >> > >> >> > > tiered storage for several days or weeks due to retention >> > policy). >> > >> >> > > >> > >> >> > > On 2025/11/19 13:29:11 jian fu wrote: >> > >> >> > > > Hi everyone, I'd like to start a discussion on KIP-1241, the >> > goal >> > >> >> is to >> > >> >> > > > reduce the remote storage. KIP: >> > >> >> > > > >> > >> >> > > >> > >> >> >> > >> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1241%3A+Reduce+tiered+storage+redundancy+with+delayed+upload >> > >> >> > > > >> > >> >> > > > The Draft PR: https://github.com/apache/kafka/pull/20913 >> > >> >> Problem: >> > >> >> > > > Currently, >> > >> >> > > > Kafka's tiered storage implementation uploads all non-active >> > >> local >> > >> >> log >> > >> >> > > > segments to remote storage immediately, even when they are >> > still >> > >> >> within >> > >> >> > > the >> > >> >> > > > local retention period. >> > >> >> > > > This results in redundant storage of the same data in both >> > local >> > >> and >> > >> >> > > remote >> > >> >> > > > tiers. >> > >> >> > > > >> > >> >> > > > When there is no requirement for real-time analytics or >> > immediate >> > >> >> > > > consumption based on remote storage. It has the following >> > >> drawbacks: >> > >> >> > > > >> > >> >> > > > 1. Wastes storage capacity and costs: The same data is >> stored >> > >> twice >> > >> >> > > during >> > >> >> > > > the local retention window >> > >> >> > > > 2. Provides no immediate benefit: During the local retention >> > >> period, >> > >> >> > > reads >> > >> >> > > > prioritize local data, making the remote copy unnecessary >> > >> >> > > > >> > >> >> > > > >> > >> >> > > > So. this KIP is to reduce tiered storage redundancy with >> > delayed >> > >> >> upload. >> > >> >> > > > You can check the test result example here directly: >> > >> >> > > > >> > >> https://github.com/apache/kafka/pull/20913#issuecomment-3547156286 >> > >> >> > > > Looking forward to your feedback! Best regards, Jian >> > >> >> > > > >> > >> >> > >> > >> > >> > >> > >> > >> >> > > >> > > >> > > >> > > >> > >> > > > > >
