Hi Greg, I personally quite like the idea of delegating the tiny objects merging task to tiered storage. Sadly, there are some drawbacks that Jun pointed out. I agree that if we are using the aggressive tiering object solution, it might de-prioritize or delay progress of the classic tiered storage topics.
> We can build the merging step to optimize WAL segments for more predictable rebuild times. But could we still perform a final move to Tiered Storage after each partition reaches the configured roll times? I think you have your imagined use cases in the future. But it doesn't make sense when you finally merge a 500 tiny small objects into one big WAL segment, then you get rid of it and upload another copy of log segment onto remote storage via tiered storage. Maybe you can consider directly appending new metadata into RLMM to point to the location of the merged WAL segments? Thank you, Luke On Fri, May 15, 2026 at 5:11 AM Greg Harris via dev <[email protected]> wrote: > Jun & Satish, > > We can build the merging step to optimize WAL segments for more predictable > rebuild times. But could we still perform a final move to Tiered Storage > after each partition reaches the configured roll times? We could expect the > same load/sizing expectations as classic topics (e.g. >1gb segments). > > We are interested in unifying with Tiered Storage for many reasons, but > also so that topics which have diskless mode dynamically enabled/disabled > can eventually converge to a predictable state. > > Thanks, > Greg > > On Wed, May 13, 2026, 3:56 AM Satish Duggana <[email protected]> > wrote: > > > RLMM was not designed for aggressive copying of the latest data to > > tiered storage by having small segment rollouts. > > > > +1 to Jun on leaving the existing RLMM for classic topics with tiered > > storage and having an efficient metadata management system required > > for diskless topics. > > > > > > On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]> > > wrote: > > > > > > Hi, Victor, > > > > > > Thanks for the reply. > > > > > > JR1. (A) and (B) Yes, your summary matches my thinking. > > > (C) "Generally I think that (i) (ii) (iii) and (iv) may be addressed > with > > > an aggressive tiered storage consolidation (the first approach)". > > > Hmm, I am confused by the above statement. By "the first approach", do > > you > > > mean aggressive tiering with faster segment rolling through the > existing > > > RLMM? I don't think the existing RLMM is designed to solve these issues > > due > > > to inefficiencies in cost, metadata propagation and metadata storage as > > we > > > previously discussed. > > > > > > JR11. I was thinking we leave the existing RLMM as is and continue to > use > > > it for classic topics. We design a new, more efficient metadata > > management > > > component independent of RLMM. This new component will be the only > > metadata > > > component that diskless topics depend on. > > > > > > Jun > > > > > > On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass <[email protected] > > > > > wrote: > > > > > > > Hi Jun, > > > > > > > > JR1 > > > > (1)-(2)-(3) I'd address these together and let me explain our current > > idea > > > > to solve the tiny object problem because I'm not sure if we're 100% > > talking > > > > about the same thing. I have two approaches in mind for TS > > consolidation > > > > ((A) and (B)) and I'm not sure if we're both assuming the same idea, > so > > > > let's clarify this. > > > > > > > > (A) > > > > This is our current assumption. This uses local disks (create classic > > > > local logs with UnifiedLog) to consolidate logs into the classic log > > format > > > > and use RSM and RLMM to store them in tiered storage. This way we're > > not > > > > limited by the need to have short rollovers. Local logs become a form > > of > > > > staging environment to serve reads and accumulate records for tiered > > > > storage. This means that: > > > > (a) Once a message is consolidated into the classic log format, we > can > > > > use it for serving lagging consumers. Diskless reads should really be > > used > > > > for the head of the log and after a few seconds logs should be > > consolidated. > > > > (b) The real cost is much closer to that 87.5% (and in fact my > google > > > > sheet I shared also assumes this model) because we have more freedom > in > > > > choosing the retention parameters of the classic log. > > > > (c) Metadata is smaller as we only need to keep diskless segments > > until > > > > the tiered offset surpasses the individual batches' offset. > > > > (d) RLMM metadata is also somewhat manageable due to the larger > > segment > > > > sizes but it's still possible to run into the metadata explosion > > problem. > > > > (e) It needs to rebuild this local log on reassignment to serve > > lagging > > > > consumers effectively, so reassignment is a bit more messy. > > > > (f) It's not optimal when partitions have a single replica: on > > failure we > > > > can only fall back to diskless mode until the partition is reassigned > > to a > > > > functioning broker. > > > > > > > > (B) > > > > Compared to the above there can be an alternative approach, which is > to > > > > consolidate when diskless segments expire (after 15 minutes for > > instance). > > > > In that case your points seem to fit better as: > > > > (a) we can only use the classic, consolidated logs to serve lagging > > > > consumers after they have been tiered > > > > (b) to be more efficient with lagging consumers we have to stick to > a > > > > short rollover > > > > (c) it's more costly due to the short rollovers > > > > (d) the RLMM bottleneck still exists due to the short rollovers > > > > (e) it's not given whether we use local disks for transforming logs > > as we > > > > can do it in memory too (which can be ineffective and more expensive) > > but > > > > perhaps a “chunked transfer encoding” that S3 supports or similar > with > > > > other providers is a cost effective way. If we know the final size > > advance, > > > > we can upload data in chunks and still get billed for 1 put. > > > > (f) more efficient reassignment or failover is cleaner and faster as > > > > there isn't a need to rebuild local caches. > > > > > > > > (C) > > > > Apart from the first 2 approaches there is a 3rd, which is WAL > > merging. To > > > > understand your points, let me summarize that I could gather so far > as > > > > reasons for WAL merging (and please correct me if I missed > something): > > > > (i) protecting consumer lag: small WAL files create inefficient > > objects > > > > for lagging consumers, so larger objects should be more efficient > > > > (ii) avoiding the RLMM replay bottleneck: managing small segments > with > > > > RLMM is very inefficient (100s of GB metadata) > > > > (iii) reducing batch metadata overhead: merging WAL files may reduce > > the > > > > metadata we need to store, but it depends on the merge algorithm and > > how we > > > > can compact batch data > > > > (iv) cost effectiveness: retrieving merged WAL files reduces the > > number > > > > of get requests to object storage > > > > (v) architectural redundancy with RLMM: ideally we wouldn't need 2 > > > > solutions to 2 somewhat similar problems (tiered storage and > diskless) > > > > > > > > Generally I think that (i) (ii) (iii) and (iv) may be addressed with > an > > > > aggressive tiered storage consolidation (the first approach), so the > > only > > > > remaining gap would be (v). I also agree that having 2 different > > solutions > > > > for metadata handling isn't ideal and perhaps there is a possibility > of > > > > improvement here. It should be possible to redesign RLMM to be more > > similar > > > > to the diskless coordinator or design a common solution. > > > > > > > > JR11 > > > > "If we support merging in the diskless coordinator, I wonder how > useful > > > > RLMM > > > > is. It seems simpler to manage all metadata from the object store in > a > > > > single place." > > > > > > > > Could you please clarify this a little bit? Do you think that we > should > > > > replace the RLMM with a solution that is more similar to the diskless > > > > coordinator or deprecate tiered storage altogether in favor of > > diskless? > > > > I'm not sure which option you're referring: > > > > (1) Unify tiered storage and diskless under a single storage layer > > (and > > > > possibly deprecate tiered storage in favor of diskless with merging > WAL > > > > segments). > > > > (2) Create a smart coordinator instead of RLMM and possibly unify > > > > metadata coordination with diskless. > > > > (3) Keep tiered storage and diskless separate with their own > solutions > > > > for metadata (probably not optimal). > > > > > > > > Thanks, > > > > Viktor > > > > > > > > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <[email protected] > > > > > > wrote: > > > > > > > >> Hi, Viktor and Greg, > > > >> > > > >> Thanks for the reply. > > > >> > > > >> JR1. > > > >> 1) Thanks for verifying the cost estimation. I noticed a bug in my > > earlier > > > >> calculation. I estimated the per broker network transfer rate at > > 2MB/sec. > > > >> It should be 4MB/sec. If I correct it, the estimated savings are > > similar > > > >> to > > > >> yours. > > > >> The cost for transferring 4MB through the network is 4 * 2 * 10^-5 = > > $8* > > > >> 10^-5 > > > >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings > > are > > > >> about 87.5%. > > > >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings > > are > > > >> 62.5%. > > > >> Savings are still significantly lower when using RLMM. > > > >> > > > >> "To me it seems like that Greg's previous suggestion for a 15 min > > rollover > > > >> may be a bit too much. With 1 hour we can achieve better cost saving > > and > > > >> less coordinate metadata being stored." > > > >> This solves the cost issue, but it has other implications (see point > > 2) > > > >> below). > > > >> > > > >> 2) "Yes, I think this is to be expected and a lot depends on the > > > >> implementation. Ideally segments or chunks should be cached to > > minimize > > > >> the > > > >> number of times segments pulled from remote storage." > > > >> In a classic topic, when a consumer lags, its requests are served > > either > > > >> from the local cache or from large objects in the object store. With > > the > > > >> current design in a diskless topic, lagging consumer requests might > be > > > >> served from tiny 500-byte objects. This will significantly slow down > > the > > > >> consumer's catch-up, which is not expected user behavior. Ideally, > we > > > >> don't > > > >> want those tiny objects to last more than a few minutes, let alone > an > > > >> hour. > > > >> > > > >> 3) "I think if my calculations are correct (and we use a 60 minute > > > >> window), > > > >> then metadata generation should be slower, please see the google > > sheet I > > > >> linked above. I think given that traffic, the current topic based > RLMM > > > >> should be able to handle it." > > > >> Why is a 60 minute window used? RLMM metadata needs to be retained > > for the > > > >> longest retention time among all topics. This means that the > retention > > > >> window can be weeks instead of 1 hour. This means that RLMM might > > need to > > > >> replay over 100GB of data during reassignment, which is not what it > is > > > >> designed for. > > > >> > > > >> JR10. "Your example of 100,000 1kb/s partitions is a borderline > case, > > > >> where > > > >> there are some configurations which are not viable due to scale or > > cost, > > > >> and some that are. It would be up to the operator to tune their > > cluster, > > > >> by > > > >> changing diskless.segment.ms > > > >> < > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > >> < > > > >> > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > >> >, > > > >> dividing up the cluster, or switching to a more scalable RLMM > > > >> implementation." > > > >> A broker with 4MB/sec produce throughput can probably be considered > > high > > > >> throughput. Even with 4K partitions per broker, we could still > > achieve an > > > >> 87.5% cost saving as listed above, if we do the right > implementation. > > So, > > > >> ideally, it would be useful to support that as well. > > > >> > > > >> JR11. "We had a short conversation with Greg and we came to the > > conclusion > > > >> that because of the explosiveness of diskless metadata, it may be > > worth > > > >> revisiting the merging case as it can indeed buy us some more cost > > saving > > > >> for the added complexity. " > > > >> If we support merging in the diskless coordinator, I wonder how > useful > > > >> RLMM > > > >> is. It seems simpler to manage all metadata from the object store > in a > > > >> single place. > > > >> > > > >> Jun > > > >> > > > >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]> > > wrote: > > > >> > > > >> > Hi Jun, > > > >> > > > > >> > Thank you for scrutinizing the scalability of the current > > > >> > direct-to-tiered-storage strategy, and its metadata scalability. > > > >> > > > > >> > One of our implicit assumptions with this design was that users > are > > able > > > >> > to choose between the Diskless and Classic mechanisms, and that > any > > > >> > situations where the Diskless design was deficient, the Classic > > topics > > > >> > could continue to be used. > > > >> > This was originally applied to low-latency use-cases, but now also > > > >> applies > > > >> > to low-throughput use-cases too. When the throughput on a topic is > > low, > > > >> the > > > >> > benefit of using Diskless is also low, because it is proportional > > to the > > > >> > amount of data transferred, and it is more likely that the batch > > > >> overhead > > > >> > of the topics is significant. > > > >> > In other words, we've been treating cost-effective support for > > > >> arbitrarily > > > >> > low throughput topics as a non-goal. > > > >> > > > > >> > Your example of 100,000 1kb/s partitions is a borderline case, > where > > > >> there > > > >> > are some configurations which are not viable due to scale or cost, > > and > > > >> some > > > >> > that are. It would be up to the operator to tune their cluster, by > > > >> changing > > > >> > diskless.segment.ms > > > >> < > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > >> > < > > > >> > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > >> >, > > > >> > dividing up the cluster, or switching to a more scalable RLMM > > > >> > implementation. > > > >> > > > > >> > Do you think we should have cost-effective support for arbitrarily > > > >> > low-throughput partitions in Diskless? How much total demand is > > there in > > > >> > partitions where batches are >1kb but the partition throughput is > > > >> <1kb/s? > > > >> > > > > >> > Thanks, > > > >> > Greg > > > >> > > > > >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass < > > [email protected] > > > >> > > > > >> > wrote: > > > >> > > > > >> >> Hi Jun, > > > >> >> > > > >> >> Regarding JR1. > > > >> >> We had a short conversation with Greg and we came to the > conclusion > > > >> that > > > >> >> because of the explosiveness of diskless metadata, it may be > worth > > > >> >> revisiting the merging case as it can indeed buy us some more > cost > > > >> saving > > > >> >> for the added complexity. Also, it would support smaller topics > > and we > > > >> >> could somewhat manage the tiered storage consolidation costs. I > > think > > > >> that > > > >> >> we would still need to consolidate WAL segments into tiered > > storage. > > > >> >> Reasons are: to limit WAL metadata, to be able to dynamically > > > >> >> enable/disable diskless and to be compatible with existing and > > future > > > >> TS > > > >> >> improvements. > > > >> >> I'll try to refresh KIP-1165 and build it into the calculator > > above (if > > > >> >> it's possible at all :) ) and come back to you. > > > >> >> Regardless, I just wanted to give a short update in the meantime, > > > >> looking > > > >> >> forward to your answer. > > > >> >> > > > >> >> Best, > > > >> >> Viktor > > > >> >> > > > >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass < > > > >> >> [email protected]> > > > >> >> wrote: > > > >> >> > > > >> >> > Hi Jun, > > > >> >> > > > > >> >> > Thanks for the quick reply. > > > >> >> > > > > >> >> > JR1. > > > >> >> > 1) Thanks for putting the numbers together. While your > > calculation > > > >> >> > seems to be correct in the sense that 6 PUTs would worsen the > > cost > > > >> >> saving > > > >> >> > benefits, I think that in a byte for byte comparison there is a > > > >> bigger > > > >> >> > difference. The reason is that the 4 tiered storage puts > transfer > > > >> much > > > >> >> more > > > >> >> > data compared to the small WAL segments, so in practice there > > should > > > >> be > > > >> >> > fewer TS puts. > > > >> >> > I made a google sheet calculator for this which I'd like to > share > > > >> with > > > >> >> > you: > > > >> >> > > > > >> >> > > > >> > > > https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906 > > > >> < > > > https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$ > > > >> > > > > >> >> > Please copy the sheet to modify the values. > > > >> >> > About my findings: I was trying to create a similar cluster > model > > > >> that > > > >> >> has > > > >> >> > been discussed here previously to see how cost varies over > > different > > > >> >> > segment rollovers.To me it seems like that Greg's previous > > suggestion > > > >> >> for a > > > >> >> > 15 min rollover may be a bit too much. With 1 hour we can > achieve > > > >> better > > > >> >> > cost saving and less coordinate metadata being stored. I have > > also > > > >> >> tried to > > > >> >> > account for the producer batch metadata generated by diskless > > > >> partitions > > > >> >> > but to me it seems like a lower number than Greg's original > > numbers. > > > >> >> > > > > >> >> > 2) "Note that local storage could be lost on reassigned > > partitions. > > > >> In > > > >> >> > that case, lagging reads can only be served from the object > > store." > > > >> >> > Yes, I think this is to be expected and a lot depends on the > > > >> >> > implementation. Ideally segments or chunks should be cached to > > > >> minimize > > > >> >> the > > > >> >> > number of times segments pulled from remote storage. > > > >> >> > > > > >> >> > "The 2MB/sec I quoted is for a specific broker. Depending on > the > > > >> broker > > > >> >> > instance type, a broker may only be able to handle low 10s of > > MB/sec > > > >> of > > > >> >> > data. So, 2MB/sec overhead is significant." > > > >> >> > Yes, I have indeed misunderstood, however I have updated my > > > >> calculator > > > >> >> > sheet with metadata calculation. Overall, the number of tiered > > > >> storage > > > >> >> > segments created seems to be much lower than in your > calculations > > > >> given > > > >> >> the > > > >> >> > parameters of the cluster you specified earlier. Please take a > > look, > > > >> I'd > > > >> >> > like to really understand the thinking here because this is a > > crucial > > > >> >> point. > > > >> >> > > > > >> >> > 3) I think if my calculations are correct (and we use a 60 > minute > > > >> >> window), > > > >> >> > then metadata generation should be slower, please see the > google > > > >> sheet I > > > >> >> > linked above. I think given that traffic, the current topic > based > > > >> RLMM > > > >> >> > should be able to handle it. > > > >> >> > In the case where we would need to make the RLMM capable of > > handling > > > >> a > > > >> >> > similar traffic as the diskless coordinator, then you're right, > > we > > > >> >> probably > > > >> >> > should consider how we can improve it. I think there are > multiple > > > >> >> > possibilities as you mentioned, but ideally there should be a > > common > > > >> >> > implementation for metadata coordination that could handle > these > > > >> cases. > > > >> >> > > > > >> >> > JR7. > > > >> >> > Yes, your expectation is totally reasonable, we should expect > > the get > > > >> >> and > > > >> >> > put operations to be strongly consistent for the > read-after-write > > > >> >> > scenarios. And I think that since major cloud providers give > > strongly > > > >> >> > consistent object storages, it should be sufficient for a wide > > > >> >> user-group. > > > >> >> > So we could shrink the scope of the KIP a bit this way and > avoid > > > >> adding > > > >> >> > complexity that is needed mostly on the margin. > > > >> >> > I can expect though that "list" can stay eventually consistent > > as the > > > >> >> KIP > > > >> >> > relies on it for only garbage collection where it is fine if a > > few > > > >> >> segments > > > >> >> > can be collected only in the next iteration. > > > >> >> > > > > >> >> > JR3. > > > >> >> > Since Greg hasn't replied yet, I'll try to catch up with him > and > > > >> >> formulate > > > >> >> > an answer next week. > > > >> >> > > > > >> >> > Best, > > > >> >> > Viktor > > > >> >> > > > > >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev < > > > >> [email protected]> > > > >> >> > wrote: > > > >> >> > > > > >> >> >> Hi, Victor, > > > >> >> >> > > > >> >> >> Thanks for the reply. > > > >> >> >> > > > >> >> >> JR1. > > > >> >> >> 1) "So while it seems to be significant that we tripled the > > number > > > >> of > > > >> >> >> PUTs, cost-wise it doesn't seem to be significant." > > > >> >> >> Let's compare the savings achieved by replacing network > > replication > > > >> >> >> transfer with S3 puts in AWS. > > > >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB > > > >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request > > > >> >> >> > > > >> >> >> The KIP batches data up to 4MB. So, let's assume that we write > > 2MB > > > >> S3 > > > >> >> >> objects on average. > > > >> >> >> > > > >> >> >> The cost for transferring 2MB through the network is 2 * 2 * > > 10^-5 = > > > >> >> $4* > > > >> >> >> 10^-5 > > > >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The > > savings > > > >> >> are > > > >> >> >> about 75%. > > > >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The > > savings > > > >> >> are > > > >> >> >> 25%. As you can see, the savings are significantly lower. > > > >> >> >> > > > >> >> >> 2) "Therefore we could expect classic local segments to be > > present > > > >> >> which > > > >> >> >> could be used for catching up consumers." > > > >> >> >> Note that local storage could be lost on reassigned > partitions. > > In > > > >> that > > > >> >> >> case, lagging reads can only be served from the object store. > > > >> >> >> > > > >> >> >> "Regarding the amount of metadata: 2MB/sec is well below the > > 2GB/s > > > >> >> >> throughput that Greg calculated previously, so I think it > > should be > > > >> >> >> manageable for a cluster with that amount of throughput," > > > >> >> >> It seems that you didn't make the correct comparison. 2GB/s > that > > > >> Greg > > > >> >> >> mentioned is the throughput for the whole cluster. The > 2MB/sec I > > > >> >> quoted is > > > >> >> >> for a specific broker. Depending on the broker instance type, > a > > > >> broker > > > >> >> may > > > >> >> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec > > > >> overhead > > > >> >> is > > > >> >> >> significant. > > > >> >> >> > > > >> >> >> 3) "I'd separate it from the discussion of diskless core and > > > >> perhaps we > > > >> >> >> could address it in a separate KIP as it is mostly a redesign > > of the > > > >> >> >> RLMM." > > > >> >> >> Those problems don't exist in the existing usage of RLMM. They > > > >> manifest > > > >> >> >> because diskless tries to use RLMM in a way it wasn't designed > > for > > > >> >> (there > > > >> >> >> is at least a 20X increase in metadata). It would be useful to > > > >> consider > > > >> >> >> whether fixing those problems in RLMM or using a new approach > is > > > >> >> >> better. For example, KIP-1164 already introduces a > snapshotting > > > >> >> mechanism. > > > >> >> >> Adding another snapshotting mechanism to RLMM seems redundant. > > > >> >> >> > > > >> >> >> JR7. A typical object store supports 3 operations: puts, gets > > and > > > >> >> lists. > > > >> >> >> Which operations used by diskless can be eventually > consistent? > > I'd > > > >> >> expect > > > >> >> >> that get should always see the result of the latest put. > > > >> >> >> > > > >> >> >> Jun > > > >> >> >> > > > >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass < > > > >> [email protected] > > > >> >> > > > > >> >> >> wrote: > > > >> >> >> > > > >> >> >> > Hi Jun, > > > >> >> >> > > > > >> >> >> > I'd like to add my thoughts too until Greg has time to > > respond. > > > >> >> >> > > > > >> >> >> > JR1. I also think there are shortcomings in the current > tiered > > > >> >> storage > > > >> >> >> > design, around the RLMM. > > > >> >> >> > 1) I think this is a correct observation, however if my > > > >> calculations > > > >> >> are > > > >> >> >> > correct, it actually comes down to a negligible amount of > > cost. > > > >> >> Taking > > > >> >> >> the > > > >> >> >> > AWS pricing sheet at > > > >> >> >> > > > > >> >> >> > > > >> >> > > > >> > > > https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps > > > >> < > > > https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$ > > > >> > > > > >> >> >> > it seems like the difference between 6 or 2 PUTs per second > is > > > >> ~$52 > > > >> >> for > > > >> >> >> a > > > >> >> >> > month. The calculation follows > > > >> >> >> > as: > 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. > > So > > > >> >> while > > > >> >> >> it > > > >> >> >> > seems to be significant that we tripled the number of PUTs, > > > >> >> cost-wise it > > > >> >> >> > doesn't seem to be significant. > > > >> >> >> > 2) Reflecting to your original problem: the tiered storage > > > >> >> consolidation > > > >> >> >> > process should be continuously running and transforming WAL > > > >> segments > > > >> >> >> into > > > >> >> >> > classic logs. Therefore we could expect classic local > > segments to > > > >> be > > > >> >> >> > present which could be used for catching up consumers. So > they > > > >> would > > > >> >> >> only > > > >> >> >> > switch to WAL reading when they're close to the end of the > > log. > > > >> Since > > > >> >> >> this > > > >> >> >> > offset space should be cached, the reads from there should > be > > > >> fast. > > > >> >> >> > Regarding the amount of metadata: 2MB/sec is well below the > > 2GB/s > > > >> >> >> > throughput that Greg calculated previously, so I think it > > should > > > >> be > > > >> >> >> > manageable for a cluster with that amount of throughput, > > although > > > >> I > > > >> >> >> agree > > > >> >> >> > with your comment that the current topic based tiered > metadata > > > >> >> manager > > > >> >> >> > isn't optimal and we could develop a better solution. > > > >> >> >> > 3) Tied to the previous point, I agree that your comments > are > > > >> >> absolutely > > > >> >> >> > valid, however similarly to that, I'd separate it from the > > > >> >> discussion of > > > >> >> >> > diskless core and perhaps we could address it in a separate > > KIP as > > > >> >> it is > > > >> >> >> > mostly a redesign of the RLMM. > > > >> >> >> > > > > >> >> >> > JR2. Ack. We will raise a KIP in the near future. > > > >> >> >> > > > > >> >> >> > JR3. I'd leave answering this to Greg as I don't have too > much > > > >> >> context > > > >> >> >> on > > > >> >> >> > this one. > > > >> >> >> > > > > >> >> >> > JR7. I think this could be similar to the tiered storage > > design, > > > >> so > > > >> >> any > > > >> >> >> > coordinator operation should be strongly consistent (since > > we're > > > >> >> using > > > >> >> >> > classic topics there). Therefore the WAL segment storage > layer > > > >> could > > > >> >> be > > > >> >> >> > eventually consistent as we store its metadata in a strongly > > > >> >> consistent > > > >> >> >> > manner. I'm not sure though if this was the answer you're > > looking > > > >> >> for? > > > >> >> >> > > > > >> >> >> > Best, > > > >> >> >> > Viktor > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev < > > > >> >> [email protected]> > > > >> >> >> > wrote: > > > >> >> >> > > > > >> >> >> >> Hi, Greg, > > > >> >> >> >> > > > >> >> >> >> Thanks for the reply. > > > >> >> >> >> > > > >> >> >> >> JR1. Rolling log segments every 15 minutes addresses the 3 > > > >> concerns > > > >> >> I > > > >> >> >> >> listed, but it introduces some new issues because it > doesn't > > > >> quite > > > >> >> fit > > > >> >> >> the > > > >> >> >> >> design of the current tiered storage. (a) The current > tiered > > > >> storage > > > >> >> >> >> design > > > >> >> >> >> stores a single partition per object. If we roll a log > > segment > > > >> >> every 15 > > > >> >> >> >> minutes, with 4K partitions per broker, this means an > > additional > > > >> 4 > > > >> >> S3 > > > >> >> >> puts > > > >> >> >> >> per second. The diskless design aims for 2 S3 puts per > > second. > > > >> So, > > > >> >> this > > > >> >> >> >> triples the S3 put cost and reduces the savings benefits. > (b) > > > >> With > > > >> >> Tier > > > >> >> >> >> storage, each broker essentially needs to read the tier > > metadata > > > >> >> from > > > >> >> >> all > > > >> >> >> >> tier metadata partitions if the number of user partitions > > exceeds > > > >> >> 50. > > > >> >> >> >> Assuming that we generate 100 bytes of tier metadata per > > > >> partition > > > >> >> >> every > > > >> >> >> >> 15 > > > >> >> >> >> minutes. Assuming that each broker has 4K partitions and a > > > >> cluster > > > >> >> of > > > >> >> >> 500 > > > >> >> >> >> brokers. Each broker needs to receive tier metadata at a > > rate of > > > >> >> 100 * > > > >> >> >> 4K > > > >> >> >> >> * > > > >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of > the > > 50 > > > >> tier > > > >> >> >> >> metadata topic partitions, it needs to send out metadata at > > 100 * > > > >> >> 4K * > > > >> >> >> 500 > > > >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases > unnecessary > > > >> network > > > >> >> >> and > > > >> >> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A > > > >> >> restarted > > > >> >> >> >> broker needs to replay the tier metadata log from the > > beginning > > > >> to > > > >> >> >> build > > > >> >> >> >> the tier metadata state. Suppose that the tier metadata log > > is > > > >> kept > > > >> >> >> for 7 > > > >> >> >> >> days. The total amount of tier metadata that needs to be > > > >> replayed is > > > >> >> >> 200KB > > > >> >> >> >> * 7 * 24 * 3600 = 120GB. > > > >> >> >> >> Does the merging optimization you mentioned address those > new > > > >> >> >> concerns? If > > > >> >> >> >> so, could you describe how it works? > > > >> >> >> >> > > > >> >> >> >> JR2. It's fine to cover the default partition assignment > > strategy > > > >> >> for > > > >> >> >> >> diskless topics in a separate KIP. However, since this is > > > >> essential > > > >> >> for > > > >> >> >> >> achieving the cost saving goal, we need a solution before > > > >> releasing > > > >> >> the > > > >> >> >> >> diskless KIP. > > > >> >> >> >> > > > >> >> >> >> JR3. Sounds good. Could you document how this work? > > > >> >> >> >> > > > >> >> >> >> JR7. Could you describe which parts of the operation can be > > > >> >> eventually > > > >> >> >> >> consistent? > > > >> >> >> >> > > > >> >> >> >> Jun > > > >> >> >> >> > > > >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris < > > > >> [email protected]> > > > >> >> >> wrote: > > > >> >> >> >> > > > >> >> >> >> > Hi Jun, > > > >> >> >> >> > > > > >> >> >> >> > Thanks for your comments! > > > >> >> >> >> > > > > >> >> >> >> > JR1: > > > >> >> >> >> > You are correct that the segment rolling configurations > are > > > >> >> currently > > > >> >> >> >> > critical to balance the scalability of Diskless and > Tiered > > > >> >> Storage, > > > >> >> >> as > > > >> >> >> >> > larger roll configurations benefit tiered storage, and > > smaller > > > >> >> roll > > > >> >> >> >> > configurations benefit Diskless. > > > >> >> >> >> > > > > >> >> >> >> > To address your points specifically: > > > >> >> >> >> > (1) A Diskless topic which is cost-competitive with an > > > >> equivalent > > > >> >> >> >> Classic > > > >> >> >> >> > topic will have a metadata size <1% of the data size. A > > cluster > > > >> >> >> storing > > > >> >> >> >> > 360GB of metadata will have >36TB of data under > management > > and > > > >> a > > > >> >> >> >> retention > > > >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will require > > > >> multiple > > > >> >> >> >> Diskless > > > >> >> >> >> > coordinators, which can share the load of storing the > > Diskless > > > >> >> >> metadata, > > > >> >> >> >> > and serving Diskless requests. > > > >> >> >> >> > (2) Catching up consumers are intended to be served from > > tiered > > > >> >> >> storage > > > >> >> >> >> > and local segment caches. Brokers which are building > their > > > >> local > > > >> >> >> segment > > > >> >> >> >> > caches will have to read many files, but will amortize > > those > > > >> >> reads by > > > >> >> >> >> > receiving data for multiple partitions in a single read. > > > >> >> >> >> > (3) This is a fundamental downside of storing data from > > > >> multiple > > > >> >> >> topics > > > >> >> >> >> in > > > >> >> >> >> > a single object, similar to classic segments. We can > > implement > > > >> a > > > >> >> >> >> > configurable cluster-wide maximum roll time, which would > > set > > > >> the > > > >> >> >> slowest > > > >> >> >> >> > cadence at which Tiered Storage segments are rolled from > > > >> Diskless > > > >> >> >> >> segments. > > > >> >> >> >> > If an individual partition has more aggressive roll > > settings, > > > >> it > > > >> >> may > > > >> >> >> be > > > >> >> >> >> > rolled earlier. > > > >> >> >> >> > This configuration would permit the cluster operator to > > > >> >> approximately > > > >> >> >> >> > bound the number of diskless WAL segments, which bounds > the > > > >> total > > > >> >> >> size > > > >> >> >> >> of > > > >> >> >> >> > the WAL segments, disk cache, diskless coordinator state, > > and > > > >> >> >> excessive > > > >> >> >> >> > retention window. For example, a diskless.segment.ms > > > >> < > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > >> > > > > >> >> of 15 minutes > > > >> >> >> >> would > > > >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to > > 1.8TB, and > > > >> >> >> permit > > > >> >> >> >> > short-retention data to be physically deleted as soon as > > ~15 > > > >> >> minutes > > > >> >> >> >> after > > > >> >> >> >> > being produced. > > > >> >> >> >> > Of course, this will reduce the size of the tiered > storage > > > >> >> segments > > > >> >> >> for > > > >> >> >> >> > topics that have low throughput, and where segment.ms > > > >> < > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > > > >> > > > > >> >> > > > > >> >> >> >> > diskless.segment.ms > > > >> < > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$ > > > >> >, > > > >> >> increasing overhead in the RLMM. We can perform > > > >> >> >> >> > merging/optimization of Tiered Storage segments to > achieve > > the > > > >> >> >> per-topic > > > >> >> >> >> > segment.ms > > > >> < > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$ > > > >> > > > > >> >> . > > > >> >> >> >> > There were some reasons why we retracted the prior > > file-merging > > > >> >> >> >> approach, > > > >> >> >> >> > and why merging in tiered storage appears better: > > > >> >> >> >> > * Rewriting files requires mutability for existing data, > > which > > > >> >> adds > > > >> >> >> >> > complexity. Diskless batches or Remote Log Segments would > > need > > > >> to > > > >> >> be > > > >> >> >> >> made > > > >> >> >> >> > mutable, and the remote log will be made mutable in > > KIP-1272 > > > >> [1] > > > >> >> >> >> > * Because a WAL Segment can contain batches from multiple > > > >> Diskless > > > >> >> >> >> > Coordinators, multiple coordinators must also be involved > > in > > > >> the > > > >> >> >> merging > > > >> >> >> >> > step. The Tiered Storage design has exclusive ownership > for > > > >> remote > > > >> >> >> log > > > >> >> >> >> > segments within the RLMM. > > > >> >> >> >> > * Diskless file merging competes for resources with > > > >> >> latency-sensitive > > > >> >> >> >> > producers and hot consumers. Tiered storage file merging > > > >> competes > > > >> >> for > > > >> >> >> >> > resources with lagging consumers, which are typically > less > > > >> latency > > > >> >> >> >> > sensitive. > > > >> >> >> >> > * Implementing merging in Tiered Storage allows this > > > >> optimization > > > >> >> to > > > >> >> >> >> > benefit both classic topics and diskless topics, covering > > both > > > >> >> high > > > >> >> >> and > > > >> >> >> >> low > > > >> >> >> >> > throughput partitions. > > > >> >> >> >> > * Remote log segments may be optimized over much longer > > time > > > >> >> windows > > > >> >> >> >> > rather than performing optimization once in the first few > > > >> hours of > > > >> >> >> the > > > >> >> >> >> life > > > >> >> >> >> > of a WAL segment and then freezing the arrangement of the > > data > > > >> >> until > > > >> >> >> it > > > >> >> >> >> is > > > >> >> >> >> > deleted. > > > >> >> >> >> > * File merging will need to rely on heuristics, which > > should be > > > >> >> >> >> > configurable by the user. Multi-partition heuristics are > > more > > > >> >> >> >> complicated > > > >> >> >> >> > to describe and reason about than single-partition > > heuristics. > > > >> >> >> >> > What do you think of this alternative? > > > >> >> >> >> > > > > >> >> >> >> > JR2: > > > >> >> >> >> > Yes, the current default partition assignment strategy > will > > > >> need > > > >> >> some > > > >> >> >> >> > improvement. This problem with Diskless WAL segments is > > > >> analogous > > > >> >> to > > > >> >> >> the > > > >> >> >> >> > Classic topics’ dense inter-broker connection graph. > > > >> >> >> >> > The natural solution to this seems to be some sort of > > cellular > > > >> >> >> design, > > > >> >> >> >> > where the replica placements tend to locate partitions in > > > >> similar > > > >> >> >> >> groups. > > > >> >> >> >> > Partitions in the same cell can generally share the same > > WAL > > > >> >> Segments > > > >> >> >> >> and > > > >> >> >> >> > the same Diskless Coordinator requests. This would also > > benefit > > > >> >> >> Classic > > > >> >> >> >> > topics, which would need fewer connections and fetch > > requests. > > > >> >> >> >> > Such a feature is out-of-scope of this KIP, and either we > > will > > > >> >> >> publish a > > > >> >> >> >> > follow-up KIP, or let operators and community tooling > > address > > > >> >> this. > > > >> >> >> >> > > > > >> >> >> >> > JR3: > > > >> >> >> >> > Yes we will replace the ISR/ELR election logic for > diskless > > > >> >> topics, > > > >> >> >> as > > > >> >> >> >> > they no longer rely on replicas for data integrity. We > will > > > >> fully > > > >> >> >> model > > > >> >> >> >> the > > > >> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and > > choose > > > >> how > > > >> >> we > > > >> >> >> >> > display this to clients. > > > >> >> >> >> > For backwards compatibility, clients using older metadata > > > >> requests > > > >> >> >> >> should > > > >> >> >> >> > see diskless topics, but interpret them as classic > topics. > > We > > > >> >> could > > > >> >> >> tell > > > >> >> >> >> > older clients that the leader is in the ISR, even if it > > just > > > >> >> started > > > >> >> >> >> > building its cache. > > > >> >> >> >> > For clients using the latest metadata, they should see > the > > true > > > >> >> >> state of > > > >> >> >> >> > the diskless partition: which nodes can accept > > > >> >> >> produce/fetch/sharefetch > > > >> >> >> >> > requests, which ranges of offsets are cached on-broker, > > etc. > > > >> This > > > >> >> >> could > > > >> >> >> >> > also be used to break apart the “leader” field into more > > > >> granular > > > >> >> >> >> fields, > > > >> >> >> >> > now that leadership has changed meaning. > > > >> >> >> >> > > > > >> >> >> >> > JR4: > > > >> >> >> >> > Yes, we can replace the empty fetch requests to the > leader > > > >> nodes > > > >> >> with > > > >> >> >> >> > cache hint fields in the requests to the Diskless > > Coordinator, > > > >> and > > > >> >> >> rely > > > >> >> >> >> on > > > >> >> >> >> > the coordinator to distribute cache hints to all > replicas. > > This > > > >> >> >> should > > > >> >> >> >> be > > > >> >> >> >> > low-overhead, and eliminate the inter-broker > communication > > for > > > >> >> >> brokers > > > >> >> >> >> > which only host Diskless topics. > > > >> >> >> >> > > > > >> >> >> >> > JR5.1: > > > >> >> >> >> > You are correct and this text was ambiguous, only > > specifying > > > >> that > > > >> >> the > > > >> >> >> >> > controller waits for the sync to be complete. This > section > > is > > > >> now > > > >> >> >> >> updated > > > >> >> >> >> > to explicitly say that local segments are built from > object > > > >> >> storage. > > > >> >> >> >> > > > > >> >> >> >> > JR5.2: > > > >> >> >> >> > Extending the JR2 discussion, reassignment of diskless > > topics > > > >> >> would > > > >> >> >> >> > generally happen within a cell, where the marginal cost > of > > > >> >> reading an > > > >> >> >> >> > additional partition is very low. When cells are > > re-balanced > > > >> and a > > > >> >> >> >> > partition is migrated between cells, there is a brief > time > > > >> (until > > > >> >> the > > > >> >> >> >> next > > > >> >> >> >> > Tiered Storage segment roll) when the marginal cost is > > doubled. > > > >> >> This > > > >> >> >> >> should > > > >> >> >> >> > be infrequent and well-amortized by other topics which > > aren’t > > > >> >> being > > > >> >> >> >> > re-balanced between cells. > > > >> >> >> >> > > > > >> >> >> >> > JR6.1: > > > >> >> >> >> > We plan to move data from Diskless to Tiered Storage. > Once > > the > > > >> >> data > > > >> >> >> is > > > >> >> >> >> in > > > >> >> >> >> > Tiered Storage, it can be compacted using the > functionality > > > >> >> >> described in > > > >> >> >> >> > KIP-1272 [1] > > > >> >> >> >> > > > > >> >> >> >> > JR6.2: > > > >> >> >> >> > We will add details for this soon. > > > >> >> >> >> > > > > >> >> >> >> > JR7: > > > >> >> >> >> > We specify the requirement of eventual consistency to > allow > > > >> >> Diskless > > > >> >> >> >> > Topics to be used with other object storage > implementations > > > >> which > > > >> >> >> aren’t > > > >> >> >> >> > the three major public clouds, such as self-managed > > software or > > > >> >> >> weaker > > > >> >> >> >> > consistency caches. > > > >> >> >> >> > > > > >> >> >> >> > Thanks, > > > >> >> >> >> > Greg > > > >> >> >> >> > > > > >> >> >> >> > [1] > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage > > > >> < > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$ > > > >> > > > > >> >> >> >> > > > > >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev < > > > >> >> [email protected] > > > >> >> >> > > > > >> >> >> >> > wrote: > > > >> >> >> >> > > > > >> >> >> >> >> Hi, Ivan, > > > >> >> >> >> >> > > > >> >> >> >> >> Thanks for the KIP. A few comments below. > > > >> >> >> >> >> > > > >> >> >> >> >> JR1. I am concerned about the usage of the current > tiered > > > >> >> storage to > > > >> >> >> >> >> control the number of small WAL files. Current tiered > > storage > > > >> >> only > > > >> >> >> >> tiers > > > >> >> >> >> >> the data when a segment rolls, which can take hours. > This > > > >> causes > > > >> >> >> three > > > >> >> >> >> >> problems. (1) Much more metadata needs to be stored and > > > >> >> maintained, > > > >> >> >> >> which > > > >> >> >> >> >> increases the cost. Suppose that each segment rolls > every > > 5 > > > >> >> hours, > > > >> >> >> each > > > >> >> >> >> >> partition generates 2 WAL files per second and each WAL > > file's > > > >> >> >> metadata > > > >> >> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K * > > 2 * > > > >> 100 > > > >> >> = > > > >> >> >> >> 3.6MB > > > >> >> >> >> >> of > > > >> >> >> >> >> metadata. In a cluster with 100K partitions, this > > translates > > > >> to > > > >> >> >> 360GB > > > >> >> >> >> of > > > >> >> >> >> >> metadata stored on the diskless coordinators. (2) A > > > >> catching-up > > > >> >> >> >> consumer's > > > >> >> >> >> >> performance degrades since it's forced to read data from > > many > > > >> >> small > > > >> >> >> WAL > > > >> >> >> >> >> files. (3) The data in WAL files could be retained much > > longer > > > >> >> than > > > >> >> >> >> >> retention time. Since the small WAL files aren't > > completely > > > >> >> deleted > > > >> >> >> >> until > > > >> >> >> >> >> all partitions' data in it are obsolete, the deletion of > > the > > > >> WAL > > > >> >> >> files > > > >> >> >> >> >> could be delayed by hours or more. If the WAL file > > includes a > > > >> >> >> partition > > > >> >> >> >> >> with a low retention time, the retention contract could > be > > > >> >> violated > > > >> >> >> >> >> significantly. The earlier design of the KIP included a > > > >> separate > > > >> >> >> object > > > >> >> >> >> >> merging process that combines small WAL files much more > > > >> >> aggressively > > > >> >> >> >> than > > > >> >> >> >> >> tiered storage, which seems to be a much better choice. > > > >> >> >> >> >> > > > >> >> >> >> >> JR2. I don't think the current default partition > > assignment > > > >> >> strategy > > > >> >> >> >> for > > > >> >> >> >> >> classic topics works for diskless topics. Current > strategy > > > >> tries > > > >> >> to > > > >> >> >> >> spread > > > >> >> >> >> >> the replicas to as many brokers as possible. For > example, > > if a > > > >> >> >> broker > > > >> >> >> >> has > > > >> >> >> >> >> 100 partitions, their replicas could be spread over 100 > > > >> brokers. > > > >> >> If > > > >> >> >> the > > > >> >> >> >> >> broker generates a WAL file with 100 partitions, this > WAL > > file > > > >> >> will > > > >> >> >> be > > > >> >> >> >> >> read > > > >> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of > > the > > > >> cost > > > >> >> of > > > >> >> >> S3 > > > >> >> >> >> >> put. > > > >> >> >> >> >> This assignment strategy will increase the S3 cost by > > about > > > >> 8X, > > > >> >> >> which > > > >> >> >> >> is > > > >> >> >> >> >> prohibitive. We need to design a cost effective > assignment > > > >> >> strategy > > > >> >> >> for > > > >> >> >> >> >> diskless topics. > > > >> >> >> >> >> > > > >> >> >> >> >> JR3. We need to think through the leade election logic > > with > > > >> >> diskless > > > >> >> >> >> >> topic. > > > >> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, > but > > it > > > >> >> doesn't > > > >> >> >> >> seem > > > >> >> >> >> >> very natural. > > > >> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In > > the > > > >> >> >> diskless > > > >> >> >> >> >> topic, the KIP says that a leader could be out of sync. > > > >> >> >> >> >> JR3.2 The existing leader election logic based on > ISR/ELR > > > >> mainly > > > >> >> >> >> retries > > > >> >> >> >> >> to > > > >> >> >> >> >> preserve previously acknowledged data. With diskless > > topics, > > > >> >> since > > > >> >> >> the > > > >> >> >> >> >> object store provides durability, this logic seems no > > longer > > > >> >> needed. > > > >> >> >> >> The > > > >> >> >> >> >> existing min.isr and unclean leader election logic also > > don't > > > >> >> apply. > > > >> >> >> >> >> > > > >> >> >> >> >> JR4. "Despite that there is no inter-broker replication, > > > >> replicas > > > >> >> >> will > > > >> >> >> >> >> still issue FetchRequest to leaders. Leaders will > respond > > with > > > >> >> empty > > > >> >> >> >> (no > > > >> >> >> >> >> records) FetchResponse." > > > >> >> >> >> >> This seems unnatural. Could we avoid issuing inter > broker > > > >> fetch > > > >> >> >> >> requests > > > >> >> >> >> >> for diskless topics? > > > >> >> >> >> >> > > > >> >> >> >> >> JR5. "The replica reassignment will follow the same flow > > as in > > > >> >> >> classic > > > >> >> >> >> >> topic:". > > > >> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response is > > alway > > > >> >> >> empty, > > > >> >> >> >> it > > > >> >> >> >> >> doesn't seem the current reassignment flow works for > > diskless > > > >> >> topic. > > > >> >> >> >> Also, > > > >> >> >> >> >> since the source of the data is object store, it seems > > more > > > >> >> natural > > > >> >> >> >> for a > > > >> >> >> >> >> replica to back fill the data from the object store, > > instead > > > >> of > > > >> >> >> other > > > >> >> >> >> >> replicas. This will also incur lower costs. > > > >> >> >> >> >> JR5.2 How do we prevent reassignment on diskless topics > > from > > > >> >> causing > > > >> >> >> >> the > > > >> >> >> >> >> same cost issue described in JR2? > > > >> >> >> >> >> > > > >> >> >> >> >> JR6." In other functional aspects, diskless topics are > > > >> >> >> >> indistinguishable > > > >> >> >> >> >> from classic topics. This includes durability > guarantees, > > > >> >> ordering > > > >> >> >> >> >> guarantees, transactional and non-transactional producer > > API, > > > >> >> >> consumer > > > >> >> >> >> >> API, > > > >> >> >> >> >> consumer groups, share groups, data retention (deletion > & > > > >> >> compact)," > > > >> >> >> >> >> JR6.1 Could you describe how compact diskless topics are > > > >> >> supported? > > > >> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the > > > >> transactional > > > >> >> >> >> support in > > > >> >> >> >> >> detail. > > > >> >> >> >> >> > > > >> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and > > > >> >> eventually > > > >> >> >> >> >> consistent storage supporting arbitrary sized byte > values > > and > > > >> a > > > >> >> >> minimal > > > >> >> >> >> >> set > > > >> >> >> >> >> of atomic operations: put, delete, list, and ranged > get." > > > >> >> >> >> >> It seems that the object storage in all three major > public > > > >> clouds > > > >> >> >> are > > > >> >> >> >> >> strongly consistent. > > > >> >> >> >> >> > > > >> >> >> >> >> Jun > > > >> >> >> >> >> > > > >> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko < > > [email protected] > > > >> > > > > >> >> >> wrote: > > > >> >> >> >> >> > > > >> >> >> >> >> > Hi all, > > > >> >> >> >> >> > > > > >> >> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's > > now > > > >> >> focus on > > > >> >> >> >> the > > > >> >> >> >> >> > technical details presented in this KIP-1163 and also > in > > > >> >> KIP-1164: > > > >> >> >> >> >> Diskless > > > >> >> >> >> >> > Coordinator [1]. > > > >> >> >> >> >> > > > > >> >> >> >> >> > Best, > > > >> >> >> >> >> > Ivan > > > >> >> >> >> >> > > > > >> >> >> >> >> > [1] > > > >> >> >> >> >> > > > > >> >> >> >> >> > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator > > > >> < > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDZKiPB2A$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$ > > > >> > > > > >> >> >> >> >> > > > > >> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote: > > > >> >> >> >> >> > > Hi all! > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > We want to start the discussion thread for KIP-1163: > > > >> Diskless > > > >> >> >> Core > > > >> >> >> >> >> [1], > > > >> >> >> >> >> > which is a sub-KIP for KIP-1150 [2]. > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for > > > >> high-level > > > >> >> >> >> >> questions, > > > >> >> >> >> >> > motivation, and general direction of the feature and > > this > > > >> >> thread > > > >> >> >> for > > > >> >> >> >> >> > particular details of implementation. > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > Best, > > > >> >> >> >> >> > > Ivan > > > >> >> >> >> >> > > > > > >> >> >> >> >> > > [1] > > > >> >> >> >> >> > > > > >> >> >> >> >> > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core > > > >> < > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDrNzi-QI$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$ > > > >> > > > > >> >> >> >> >> > > [2] > > > >> >> >> >> >> > > > > >> >> >> >> >> > > > >> >> >> >> > > > >> >> >> > > > >> >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > > >> < > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDgFavpPM$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$ > > > >> > > > > >> >> >> >> >> > > [3] > > > >> >> >> >> > > https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d > > > >> < > > > https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND75I4_MY$ > > > > > > >> >> < > > > >> > > > https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$ > > > >> > > > > >> >> >> >> >> > > > > >> >> >> >> >> > > > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> > > > > >> >> >> > > > >> >> > > > > >> >> > > > >> > > > > >> > > > > > > >
