Re: [DISCUSS] KIP-1163: Diskless Core

Greg Harris via dev Thu, 14 May 2026 13:10:48 -0700

Jun & Satish,

We can build the merging step to optimize WAL segments for more predictable
rebuild times. But could we still perform a final move to Tiered Storage
after each partition reaches the configured roll times? We could expect the
same load/sizing expectations as classic topics (e.g. >1gb segments).


We are interested in unifying with Tiered Storage for many reasons, but
also so that topics which have diskless mode dynamically enabled/disabled
can eventually converge to a predictable state.

Thanks,
Greg

On Wed, May 13, 2026, 3:56 AM Satish Duggana <[email protected]>
wrote:

> RLMM was not designed for aggressive copying of the latest data to
> tiered storage by having small segment rollouts.
>
> +1 to Jun on leaving the existing RLMM for classic topics with tiered
> storage and having an efficient metadata management system required
> for diskless topics.
>
>
> On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]>
> wrote:
> >
> > Hi, Victor,
> >
> > Thanks for the reply.
> >
> > JR1. (A) and (B) Yes, your summary matches my thinking.
> > (C) "Generally I think that (i) (ii) (iii) and (iv) may be addressed with
> > an aggressive tiered storage consolidation (the first approach)".
> > Hmm, I am confused by the above statement. By "the first approach", do
> you
> > mean aggressive tiering with faster segment rolling through the existing
> > RLMM? I don't think the existing RLMM is designed to solve these issues
> due
> > to inefficiencies in cost, metadata propagation and metadata storage as
> we
> > previously discussed.
> >
> > JR11. I was thinking we leave the existing RLMM as is and continue to use
> > it for classic topics. We design a new, more efficient metadata
> management
> > component independent of RLMM. This new component will be the only
> metadata
> > component that diskless topics depend on.
> >
> > Jun
> >
> > On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass <[email protected]>
> > wrote:
> >
> > > Hi Jun,
> > >
> > > JR1
> > > (1)-(2)-(3) I'd address these together and let me explain our current
> idea
> > > to solve the tiny object problem because I'm not sure if we're 100%
> talking
> > > about the same thing. I have two approaches in mind for TS
> consolidation
> > > ((A) and (B)) and I'm not sure if we're both assuming the same idea, so
> > > let's clarify this.
> > >
> > > (A)
> > > This is our current assumption. This uses local disks (create classic
> > > local logs with UnifiedLog) to consolidate logs into the classic log
> format
> > > and use RSM and RLMM to store them in tiered storage. This way we're
> not
> > > limited by the need to have short rollovers. Local logs become a form
> of
> > > staging environment to serve reads and accumulate records for tiered
> > > storage. This means that:
> > >  (a) Once a message is consolidated into the classic log format, we can
> > > use it for serving lagging consumers. Diskless reads should really be
> used
> > > for the head of the log and after a few seconds logs should be
> consolidated.
> > >  (b) The real cost is much closer to that 87.5% (and in fact my google
> > > sheet I shared also assumes this model) because we have more freedom in
> > > choosing the retention parameters of the classic log.
> > >  (c) Metadata is smaller as we only need to keep diskless segments
> until
> > > the tiered offset surpasses the individual batches' offset.
> > >  (d) RLMM metadata is also somewhat manageable due to the larger
> segment
> > > sizes but it's still possible to run into the metadata explosion
> problem.
> > >  (e) It needs to rebuild this local log on reassignment to serve
> lagging
> > > consumers effectively, so reassignment is a bit more messy.
> > >  (f) It's not optimal when partitions have a single replica: on
> failure we
> > > can only fall back to diskless mode until the partition is reassigned
> to a
> > > functioning broker.
> > >
> > > (B)
> > > Compared to the above there can be an alternative approach, which is to
> > > consolidate when diskless segments expire (after 15 minutes for
> instance).
> > > In that case your points seem to fit better as:
> > >  (a) we can only use the classic, consolidated logs to serve lagging
> > > consumers after they have been tiered
> > >  (b) to be more efficient with lagging consumers we have to stick to a
> > > short rollover
> > >  (c) it's more costly due to the short rollovers
> > >  (d) the RLMM bottleneck still exists due to the short rollovers
> > >  (e) it's not given whether we use local disks for transforming logs
> as we
> > > can do it in memory too (which can be ineffective and more expensive)
> but
> > > perhaps a “chunked transfer encoding” that S3 supports or similar with
> > > other providers is a cost effective way. If we know the final size
> advance,
> > > we can upload data in chunks and still get billed for 1 put.
> > >  (f) more efficient reassignment or failover is cleaner and faster as
> > > there isn't a need to rebuild local caches.
> > >
> > > (C)
> > > Apart from the first 2 approaches there is a 3rd, which is WAL
> merging. To
> > > understand your points, let me summarize that I could gather so far as
> > > reasons for WAL merging (and please correct me if I missed something):
> > >  (i) protecting consumer lag: small WAL files create inefficient
> objects
> > > for lagging consumers, so larger objects should be more efficient
> > >  (ii) avoiding the RLMM replay bottleneck: managing small segments with
> > > RLMM is very inefficient (100s of GB metadata)
> > >  (iii) reducing batch metadata overhead: merging WAL files may reduce
> the
> > > metadata we need to store, but it depends on the merge algorithm and
> how we
> > > can compact batch data
> > >  (iv) cost effectiveness: retrieving merged WAL files reduces the
> number
> > > of get requests to object storage
> > >  (v) architectural redundancy with RLMM: ideally we wouldn't need 2
> > > solutions to 2 somewhat similar problems (tiered storage and diskless)
> > >
> > > Generally I think that (i) (ii) (iii) and (iv) may be addressed with an
> > > aggressive tiered storage consolidation (the first approach), so the
> only
> > > remaining gap would be (v). I also agree that having 2 different
> solutions
> > > for metadata handling isn't ideal and perhaps there is a possibility of
> > > improvement here. It should be possible to redesign RLMM to be more
> similar
> > > to the diskless coordinator or design a common solution.
> > >
> > > JR11
> > > "If we support merging in the diskless coordinator, I wonder how useful
> > > RLMM
> > > is. It seems simpler to manage all metadata from the object store in a
> > > single place."
> > >
> > > Could you please clarify this a little bit? Do you think that we should
> > > replace the RLMM with a solution that is more similar to the diskless
> > > coordinator or deprecate tiered storage altogether in favor of
> diskless?
> > > I'm not sure which option you're referring:
> > >  (1) Unify tiered storage and diskless under a single storage layer
> (and
> > > possibly deprecate tiered storage in favor of diskless with merging WAL
> > > segments).
> > >  (2) Create a smart coordinator instead of RLMM and possibly unify
> > > metadata coordination with diskless.
> > >  (3) Keep tiered storage and diskless separate with their own solutions
> > > for metadata (probably not optimal).
> > >
> > > Thanks,
> > > Viktor
> > >
> > > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <[email protected]>
> > > wrote:
> > >
> > >> Hi, Viktor and Greg,
> > >>
> > >> Thanks for the reply.
> > >>
> > >> JR1.
> > >> 1) Thanks for verifying the cost estimation. I noticed a bug in my
> earlier
> > >> calculation. I estimated the per broker network transfer rate at
> 2MB/sec.
> > >> It should be 4MB/sec. If I correct it, the estimated savings are
> similar
> > >> to
> > >> yours.
> > >> The cost for transferring 4MB through the network is 4 * 2 * 10^-5 =
> $8*
> > >> 10^-5
> > >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings
> are
> > >> about 87.5%.
> > >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings
> are
> > >> 62.5%.
> > >> Savings are still significantly lower when using RLMM.
> > >>
> > >> "To me it seems like that Greg's previous suggestion for a 15 min
> rollover
> > >> may be a bit too much. With 1 hour we can achieve better cost saving
> and
> > >> less coordinate metadata being stored."
> > >> This solves the cost issue, but it has other implications (see point
> 2)
> > >> below).
> > >>
> > >> 2) "Yes, I think this is to be expected and a lot depends on the
> > >> implementation. Ideally segments or chunks should be cached to
> minimize
> > >> the
> > >> number of times segments pulled from remote storage."
> > >> In a classic topic, when a consumer lags, its requests are served
> either
> > >> from the local cache or from large objects in the object store. With
> the
> > >> current design in a diskless topic, lagging consumer requests might be
> > >> served from tiny 500-byte objects. This will significantly slow down
> the
> > >> consumer's catch-up, which is not expected user behavior. Ideally, we
> > >> don't
> > >> want those tiny objects to last more than a few minutes, let alone an
> > >> hour.
> > >>
> > >> 3) "I think if my calculations are correct (and we use a 60 minute
> > >> window),
> > >> then metadata generation should be slower, please see the google
> sheet I
> > >> linked above. I think given that traffic, the current topic based RLMM
> > >> should be able to handle it."
> > >> Why is a 60 minute window used? RLMM metadata needs to be retained
> for the
> > >> longest retention time among all topics. This means that the retention
> > >> window can be weeks instead of 1 hour. This means that RLMM might
> need to
> > >> replay over 100GB of data during reassignment, which is not what it is
> > >> designed for.
> > >>
> > >> JR10. "Your example of 100,000 1kb/s partitions is a borderline case,
> > >> where
> > >> there are some configurations which are not viable due to scale or
> cost,
> > >> and some that are. It would be up to the operator to tune their
> cluster,
> > >> by
> > >> changing diskless.segment.ms
> > >> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
> >
> > >> <
> > >>
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> > >> >,
> > >> dividing up the cluster, or switching to a more scalable RLMM
> > >> implementation."
> > >> A broker with 4MB/sec produce throughput can probably be considered
> high
> > >> throughput. Even with 4K partitions per broker, we could still
> achieve an
> > >> 87.5% cost saving as listed above, if we do the right implementation.
> So,
> > >> ideally, it would be useful to support that as well.
> > >>
> > >> JR11. "We had a short conversation with Greg and we came to the
> conclusion
> > >> that because of the explosiveness of diskless metadata, it may be
> worth
> > >> revisiting the merging case as it can indeed buy us some more cost
> saving
> > >> for the added complexity. "
> > >> If we support merging in the diskless coordinator, I wonder how useful
> > >> RLMM
> > >> is. It seems simpler to manage all metadata from the object store in a
> > >> single place.
> > >>
> > >> Jun
> > >>
> > >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]>
> wrote:
> > >>
> > >> > Hi Jun,
> > >> >
> > >> > Thank you for scrutinizing the scalability of the current
> > >> > direct-to-tiered-storage strategy, and its metadata scalability.
> > >> >
> > >> > One of our implicit assumptions with this design was that users are
> able
> > >> > to choose between the Diskless and Classic mechanisms, and that any
> > >> > situations where the Diskless design was deficient, the Classic
> topics
> > >> > could continue to be used.
> > >> > This was originally applied to low-latency use-cases, but now also
> > >> applies
> > >> > to low-throughput use-cases too. When the throughput on a topic is
> low,
> > >> the
> > >> > benefit of using Diskless is also low, because it is proportional
> to the
> > >> > amount of data transferred, and it is more likely that the batch
> > >> overhead
> > >> > of the topics is significant.
> > >> > In other words, we've been treating cost-effective support for
> > >> arbitrarily
> > >> > low throughput topics as a non-goal.
> > >> >
> > >> > Your example of 100,000 1kb/s partitions is a borderline case, where
> > >> there
> > >> > are some configurations which are not viable due to scale or cost,
> and
> > >> some
> > >> > that are. It would be up to the operator to tune their cluster, by
> > >> changing
> > >> > diskless.segment.ms
> > >> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
> >
> > >> > <
> > >>
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> > >> >,
> > >> > dividing up the cluster, or switching to a more scalable RLMM
> > >> > implementation.
> > >> >
> > >> > Do you think we should have cost-effective support for arbitrarily
> > >> > low-throughput partitions in Diskless? How much total demand is
> there in
> > >> > partitions where batches are >1kb but the partition throughput is
> > >> <1kb/s?
> > >> >
> > >> > Thanks,
> > >> > Greg
> > >> >
> > >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass <
> [email protected]
> > >> >
> > >> > wrote:
> > >> >
> > >> >> Hi Jun,
> > >> >>
> > >> >> Regarding JR1.
> > >> >> We had a short conversation with Greg and we came to the conclusion
> > >> that
> > >> >> because of the explosiveness of diskless metadata, it may be worth
> > >> >> revisiting the merging case as it can indeed buy us some more cost
> > >> saving
> > >> >> for the added complexity. Also, it would support smaller topics
> and we
> > >> >> could somewhat manage the tiered storage consolidation costs. I
> think
> > >> that
> > >> >> we would still need to consolidate WAL segments into tiered
> storage.
> > >> >> Reasons are: to limit WAL metadata, to be able to dynamically
> > >> >> enable/disable diskless and to be compatible with existing and
> future
> > >> TS
> > >> >> improvements.
> > >> >> I'll try to refresh KIP-1165 and build it into the calculator
> above (if
> > >> >> it's possible at all :) ) and come back to you.
> > >> >> Regardless, I just wanted to give a short update in the meantime,
> > >> looking
> > >> >> forward to your answer.
> > >> >>
> > >> >> Best,
> > >> >> Viktor
> > >> >>
> > >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass <
> > >> >> [email protected]>
> > >> >> wrote:
> > >> >>
> > >> >> > Hi Jun,
> > >> >> >
> > >> >> > Thanks for the quick reply.
> > >> >> >
> > >> >> > JR1.
> > >> >> > 1) Thanks for putting the numbers together. While your
> calculation
> > >> >> > seems to be correct in the sense that 6 PUTs would worsen the
> cost
> > >> >> saving
> > >> >> > benefits, I think that in a byte for byte comparison there is a
> > >> bigger
> > >> >> > difference. The reason is that the 4 tiered storage puts transfer
> > >> much
> > >> >> more
> > >> >> > data compared to the small WAL segments, so in practice there
> should
> > >> be
> > >> >> > fewer TS puts.
> > >> >> > I made a google sheet calculator for this which I'd like to share
> > >> with
> > >> >> > you:
> > >> >> >
> > >> >>
> > >>
> https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906
> > >> <
> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$
> > >> >
> > >> >> > Please copy the sheet to modify the values.
> > >> >> > About my findings: I was trying to create a similar cluster model
> > >> that
> > >> >> has
> > >> >> > been discussed here previously to see how cost varies over
> different
> > >> >> > segment rollovers.To me it seems like that Greg's previous
> suggestion
> > >> >> for a
> > >> >> > 15 min rollover may be a bit too much. With 1 hour we can achieve
> > >> better
> > >> >> > cost saving and less coordinate metadata being stored. I have
> also
> > >> >> tried to
> > >> >> > account for the producer batch metadata generated by diskless
> > >> partitions
> > >> >> > but to me it seems like a lower number than Greg's original
> numbers.
> > >> >> >
> > >> >> > 2) "Note that local storage could be lost on reassigned
> partitions.
> > >> In
> > >> >> > that case, lagging reads can only be served from the object
> store."
> > >> >> > Yes, I think this is to be expected and a lot depends on the
> > >> >> > implementation. Ideally segments or chunks should be cached to
> > >> minimize
> > >> >> the
> > >> >> > number of times segments pulled from remote storage.
> > >> >> >
> > >> >> > "The 2MB/sec I quoted is for a specific broker. Depending on the
> > >> broker
> > >> >> > instance type, a broker may only be able to handle low 10s of
> MB/sec
> > >> of
> > >> >> > data. So, 2MB/sec overhead is significant."
> > >> >> > Yes, I have indeed misunderstood, however I have updated my
> > >> calculator
> > >> >> > sheet with metadata calculation. Overall, the number of tiered
> > >> storage
> > >> >> > segments created seems to be much lower than in your calculations
> > >> given
> > >> >> the
> > >> >> > parameters of the cluster you specified earlier. Please take a
> look,
> > >> I'd
> > >> >> > like to really understand the thinking here because this is a
> crucial
> > >> >> point.
> > >> >> >
> > >> >> > 3) I think if my calculations are correct (and we use a 60 minute
> > >> >> window),
> > >> >> > then metadata generation should be slower, please see the google
> > >> sheet I
> > >> >> > linked above. I think given that traffic, the current topic based
> > >> RLMM
> > >> >> > should be able to handle it.
> > >> >> > In the case where we would need to make the RLMM capable of
> handling
> > >> a
> > >> >> > similar traffic as the diskless coordinator, then you're right,
> we
> > >> >> probably
> > >> >> > should consider how we can improve it. I think there are multiple
> > >> >> > possibilities as you mentioned, but ideally there should be a
> common
> > >> >> > implementation for metadata coordination that could handle these
> > >> cases.
> > >> >> >
> > >> >> > JR7.
> > >> >> > Yes, your expectation is totally reasonable, we should expect
> the get
> > >> >> and
> > >> >> > put operations to be strongly consistent for the read-after-write
> > >> >> > scenarios. And I think that since major cloud providers give
> strongly
> > >> >> > consistent object storages, it should be sufficient for a wide
> > >> >> user-group.
> > >> >> > So we could shrink the scope of the KIP a bit this way and avoid
> > >> adding
> > >> >> > complexity that is needed mostly on the margin.
> > >> >> > I can expect though that "list" can stay eventually consistent
> as the
> > >> >> KIP
> > >> >> > relies on it for only garbage collection where it is fine if a
> few
> > >> >> segments
> > >> >> > can be collected only in the next iteration.
> > >> >> >
> > >> >> > JR3.
> > >> >> > Since Greg hasn't replied yet, I'll try to catch up with him and
> > >> >> formulate
> > >> >> > an answer next week.
> > >> >> >
> > >> >> > Best,
> > >> >> > Viktor
> > >> >> >
> > >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev <
> > >> [email protected]>
> > >> >> > wrote:
> > >> >> >
> > >> >> >> Hi, Victor,
> > >> >> >>
> > >> >> >> Thanks for the reply.
> > >> >> >>
> > >> >> >> JR1.
> > >> >> >> 1)  "So while it seems to be significant that we tripled the
> number
> > >> of
> > >> >> >> PUTs, cost-wise it doesn't seem to be significant."
> > >> >> >> Let's compare the savings achieved by replacing network
> replication
> > >> >> >> transfer with S3 puts in AWS.
> > >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB
> > >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request
> > >> >> >>
> > >> >> >> The KIP batches data up to 4MB. So, let's assume that we write
> 2MB
> > >> S3
> > >> >> >> objects on average.
> > >> >> >>
> > >> >> >> The cost for transferring 2MB through the network is 2 * 2 *
> 10^-5 =
> > >> >> $4*
> > >> >> >> 10^-5
> > >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The
> savings
> > >> >> are
> > >> >> >> about 75%.
> > >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The
> savings
> > >> >> are
> > >> >> >> 25%. As you can see, the savings are significantly lower.
> > >> >> >>
> > >> >> >> 2) "Therefore we could expect classic local segments to be
> present
> > >> >> which
> > >> >> >> could be used for catching up consumers."
> > >> >> >> Note that local storage could be lost on reassigned partitions.
> In
> > >> that
> > >> >> >> case, lagging reads can only be served from the object store.
> > >> >> >>
> > >> >> >> "Regarding the amount of metadata: 2MB/sec is well below the
> 2GB/s
> > >> >> >> throughput that Greg calculated previously, so I think it
> should be
> > >> >> >> manageable for a cluster with that amount of throughput,"
> > >> >> >> It seems that you didn't make the correct comparison. 2GB/s that
> > >> Greg
> > >> >> >> mentioned is the throughput for the whole cluster. The 2MB/sec I
> > >> >> quoted is
> > >> >> >> for a specific broker. Depending on the broker instance type, a
> > >> broker
> > >> >> may
> > >> >> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec
> > >> overhead
> > >> >> is
> > >> >> >> significant.
> > >> >> >>
> > >> >> >> 3) "I'd separate it from the discussion of diskless core and
> > >> perhaps we
> > >> >> >> could address it in a separate KIP as it is mostly a redesign
> of the
> > >> >> >> RLMM."
> > >> >> >> Those problems don't exist in the existing usage of RLMM. They
> > >> manifest
> > >> >> >> because diskless tries to use RLMM in a way it wasn't designed
> for
> > >> >> (there
> > >> >> >> is at least a 20X increase in metadata). It would be useful to
> > >> consider
> > >> >> >> whether fixing those problems in RLMM or using a new approach is
> > >> >> >> better. For example, KIP-1164 already introduces a snapshotting
> > >> >> mechanism.
> > >> >> >> Adding another snapshotting mechanism to RLMM seems redundant.
> > >> >> >>
> > >> >> >> JR7. A typical object store supports 3 operations: puts, gets
> and
> > >> >> lists.
> > >> >> >> Which operations used by diskless can be eventually consistent?
> I'd
> > >> >> expect
> > >> >> >> that get should always see the result of the latest put.
> > >> >> >>
> > >> >> >> Jun
> > >> >> >>
> > >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass <
> > >> [email protected]
> > >> >> >
> > >> >> >> wrote:
> > >> >> >>
> > >> >> >> > Hi Jun,
> > >> >> >> >
> > >> >> >> > I'd like to add my thoughts too until Greg has time to
> respond.
> > >> >> >> >
> > >> >> >> > JR1. I also think there are shortcomings in the current tiered
> > >> >> storage
> > >> >> >> > design, around the RLMM.
> > >> >> >> > 1) I think this is a correct observation, however if my
> > >> calculations
> > >> >> are
> > >> >> >> > correct, it actually comes down to a negligible amount of
> cost.
> > >> >> Taking
> > >> >> >> the
> > >> >> >> > AWS pricing sheet at
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps
> > >> <
> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$
> > >> >
> > >> >> >> > it seems like the difference between 6 or 2 PUTs per second is
> > >> ~$52
> > >> >> for
> > >> >> >> a
> > >> >> >> > month. The calculation follows
> > >> >> >> > as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84.
> So
> > >> >> while
> > >> >> >> it
> > >> >> >> > seems to be significant that we tripled the number of PUTs,
> > >> >> cost-wise it
> > >> >> >> > doesn't seem to be significant.
> > >> >> >> > 2) Reflecting to your original problem: the tiered storage
> > >> >> consolidation
> > >> >> >> > process should be continuously running and transforming WAL
> > >> segments
> > >> >> >> into
> > >> >> >> > classic logs. Therefore we could expect classic local
> segments to
> > >> be
> > >> >> >> > present which could be used for catching up consumers. So they
> > >> would
> > >> >> >> only
> > >> >> >> > switch to WAL reading when they're close to the end of the
> log.
> > >> Since
> > >> >> >> this
> > >> >> >> > offset space should be cached, the reads from there should be
> > >> fast.
> > >> >> >> > Regarding the amount of metadata: 2MB/sec is well below the
> 2GB/s
> > >> >> >> > throughput that Greg calculated previously, so I think it
> should
> > >> be
> > >> >> >> > manageable for a cluster with that amount of throughput,
> although
> > >> I
> > >> >> >> agree
> > >> >> >> > with your comment that the current topic based tiered metadata
> > >> >> manager
> > >> >> >> > isn't optimal and we could develop a better solution.
> > >> >> >> > 3) Tied to the previous point, I agree that your comments are
> > >> >> absolutely
> > >> >> >> > valid, however similarly to that, I'd separate it from the
> > >> >> discussion of
> > >> >> >> > diskless core and perhaps we could address it in a separate
> KIP as
> > >> >> it is
> > >> >> >> > mostly a redesign of the RLMM.
> > >> >> >> >
> > >> >> >> > JR2. Ack. We will raise a KIP in the near future.
> > >> >> >> >
> > >> >> >> > JR3. I'd leave answering this to Greg as I don't have too much
> > >> >> context
> > >> >> >> on
> > >> >> >> > this one.
> > >> >> >> >
> > >> >> >> > JR7. I think this could be similar to the tiered storage
> design,
> > >> so
> > >> >> any
> > >> >> >> > coordinator operation should be strongly consistent (since
> we're
> > >> >> using
> > >> >> >> > classic topics there). Therefore the WAL segment storage layer
> > >> could
> > >> >> be
> > >> >> >> > eventually consistent as we store its metadata in a strongly
> > >> >> consistent
> > >> >> >> > manner. I'm not sure though if this was the answer you're
> looking
> > >> >> for?
> > >> >> >> >
> > >> >> >> > Best,
> > >> >> >> > Viktor
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev <
> > >> >> [email protected]>
> > >> >> >> > wrote:
> > >> >> >> >
> > >> >> >> >> Hi, Greg,
> > >> >> >> >>
> > >> >> >> >> Thanks for the reply.
> > >> >> >> >>
> > >> >> >> >> JR1. Rolling log segments every 15 minutes addresses the 3
> > >> concerns
> > >> >> I
> > >> >> >> >> listed, but it introduces some new issues because it doesn't
> > >> quite
> > >> >> fit
> > >> >> >> the
> > >> >> >> >> design of the current tiered storage. (a) The current tiered
> > >> storage
> > >> >> >> >> design
> > >> >> >> >> stores a single partition per object. If we roll a log
> segment
> > >> >> every 15
> > >> >> >> >> minutes, with 4K partitions per broker, this means an
> additional
> > >> 4
> > >> >> S3
> > >> >> >> puts
> > >> >> >> >> per second. The diskless design aims for 2 S3 puts per
> second.
> > >> So,
> > >> >> this
> > >> >> >> >> triples the S3 put cost and reduces the savings benefits. (b)
> > >> With
> > >> >> Tier
> > >> >> >> >> storage, each broker essentially needs to read the tier
> metadata
> > >> >> from
> > >> >> >> all
> > >> >> >> >> tier metadata partitions if the number of user partitions
> exceeds
> > >> >> 50.
> > >> >> >> >> Assuming that we generate 100 bytes of tier metadata per
> > >> partition
> > >> >> >> every
> > >> >> >> >> 15
> > >> >> >> >> minutes. Assuming that each broker has 4K partitions and a
> > >> cluster
> > >> >> of
> > >> >> >> 500
> > >> >> >> >> brokers. Each broker needs to receive tier metadata at a
> rate of
> > >> >> 100 *
> > >> >> >> 4K
> > >> >> >> >> *
> > >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the
> 50
> > >> tier
> > >> >> >> >> metadata topic partitions, it needs to send out metadata at
> 100 *
> > >> >> 4K *
> > >> >> >> 500
> > >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary
> > >> network
> > >> >> >> and
> > >> >> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A
> > >> >> restarted
> > >> >> >> >> broker needs to replay the tier metadata log from the
> beginning
> > >> to
> > >> >> >> build
> > >> >> >> >> the tier metadata state. Suppose that the tier metadata log
> is
> > >> kept
> > >> >> >> for 7
> > >> >> >> >> days. The total amount of tier metadata that needs to be
> > >> replayed is
> > >> >> >> 200KB
> > >> >> >> >> * 7 * 24 * 3600 = 120GB.
> > >> >> >> >> Does the merging optimization you mentioned address those new
> > >> >> >> concerns? If
> > >> >> >> >> so, could you describe how it works?
> > >> >> >> >>
> > >> >> >> >> JR2. It's fine to cover the default partition assignment
> strategy
> > >> >> for
> > >> >> >> >> diskless topics in a separate KIP. However, since this is
> > >> essential
> > >> >> for
> > >> >> >> >> achieving the cost saving goal, we need a solution before
> > >> releasing
> > >> >> the
> > >> >> >> >> diskless KIP.
> > >> >> >> >>
> > >> >> >> >> JR3. Sounds good. Could you document how this work?
> > >> >> >> >>
> > >> >> >> >> JR7. Could you describe which parts of the operation can be
> > >> >> eventually
> > >> >> >> >> consistent?
> > >> >> >> >>
> > >> >> >> >> Jun
> > >> >> >> >>
> > >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris <
> > >> [email protected]>
> > >> >> >> wrote:
> > >> >> >> >>
> > >> >> >> >> > Hi Jun,
> > >> >> >> >> >
> > >> >> >> >> > Thanks for your comments!
> > >> >> >> >> >
> > >> >> >> >> > JR1:
> > >> >> >> >> > You are correct that the segment rolling configurations are
> > >> >> currently
> > >> >> >> >> > critical to balance the scalability of Diskless and Tiered
> > >> >> Storage,
> > >> >> >> as
> > >> >> >> >> > larger roll configurations benefit tiered storage, and
> smaller
> > >> >> roll
> > >> >> >> >> > configurations benefit Diskless.
> > >> >> >> >> >
> > >> >> >> >> > To address your points specifically:
> > >> >> >> >> > (1) A Diskless topic which is cost-competitive with an
> > >> equivalent
> > >> >> >> >> Classic
> > >> >> >> >> > topic will have a metadata size <1% of the data size. A
> cluster
> > >> >> >> storing
> > >> >> >> >> > 360GB of metadata will have >36TB of data under management
> and
> > >> a
> > >> >> >> >> retention
> > >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will require
> > >> multiple
> > >> >> >> >> Diskless
> > >> >> >> >> > coordinators, which can share the load of storing the
> Diskless
> > >> >> >> metadata,
> > >> >> >> >> > and serving Diskless requests.
> > >> >> >> >> > (2) Catching up consumers are intended to be served from
> tiered
> > >> >> >> storage
> > >> >> >> >> > and local segment caches. Brokers which are building their
> > >> local
> > >> >> >> segment
> > >> >> >> >> > caches will have to read many files, but will amortize
> those
> > >> >> reads by
> > >> >> >> >> > receiving data for multiple partitions in a single read.
> > >> >> >> >> > (3) This is a fundamental downside of storing data from
> > >> multiple
> > >> >> >> topics
> > >> >> >> >> in
> > >> >> >> >> > a single object, similar to classic segments. We can
> implement
> > >> a
> > >> >> >> >> > configurable cluster-wide maximum roll time, which would
> set
> > >> the
> > >> >> >> slowest
> > >> >> >> >> > cadence at which Tiered Storage segments are rolled from
> > >> Diskless
> > >> >> >> >> segments.
> > >> >> >> >> > If an individual partition has more aggressive roll
> settings,
> > >> it
> > >> >> may
> > >> >> >> be
> > >> >> >> >> > rolled earlier.
> > >> >> >> >> > This configuration would permit the cluster operator to
> > >> >> approximately
> > >> >> >> >> > bound the number of diskless WAL segments, which bounds the
> > >> total
> > >> >> >> size
> > >> >> >> >> of
> > >> >> >> >> > the WAL segments, disk cache, diskless coordinator state,
> and
> > >> >> >> excessive
> > >> >> >> >> > retention window. For example, a diskless.segment.ms
> > >> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> > >> >
> > >> >> of 15 minutes
> > >> >> >> >> would
> > >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to
> 1.8TB, and
> > >> >> >> permit
> > >> >> >> >> > short-retention data to be physically deleted as soon as
> ~15
> > >> >> minutes
> > >> >> >> >> after
> > >> >> >> >> > being produced.
> > >> >> >> >> > Of course, this will reduce the size of the tiered storage
> > >> >> segments
> > >> >> >> for
> > >> >> >> >> > topics that have low throughput, and where segment.ms
> > >> <
> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
> > >> >
> > >> >> >
> > >> >> >> >> > diskless.segment.ms
> > >> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> > >> >,
> > >> >> increasing overhead in the RLMM. We can perform
> > >> >> >> >> > merging/optimization of Tiered Storage segments to achieve
> the
> > >> >> >> per-topic
> > >> >> >> >> > segment.ms
> > >> <
> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
> > >> >
> > >> >> .
> > >> >> >> >> > There were some reasons why we retracted the prior
> file-merging
> > >> >> >> >> approach,
> > >> >> >> >> > and why merging in tiered storage appears better:
> > >> >> >> >> > * Rewriting files requires mutability for existing data,
> which
> > >> >> adds
> > >> >> >> >> > complexity. Diskless batches or Remote Log Segments would
> need
> > >> to
> > >> >> be
> > >> >> >> >> made
> > >> >> >> >> > mutable, and the remote log will be made mutable in
> KIP-1272
> > >> [1]
> > >> >> >> >> > * Because a WAL Segment can contain batches from multiple
> > >> Diskless
> > >> >> >> >> > Coordinators, multiple coordinators must also be involved
> in
> > >> the
> > >> >> >> merging
> > >> >> >> >> > step. The Tiered Storage design has exclusive ownership for
> > >> remote
> > >> >> >> log
> > >> >> >> >> > segments within the RLMM.
> > >> >> >> >> > * Diskless file merging competes for resources with
> > >> >> latency-sensitive
> > >> >> >> >> > producers and hot consumers. Tiered storage file merging
> > >> competes
> > >> >> for
> > >> >> >> >> > resources with lagging consumers, which are typically less
> > >> latency
> > >> >> >> >> > sensitive.
> > >> >> >> >> > * Implementing merging in Tiered Storage allows this
> > >> optimization
> > >> >> to
> > >> >> >> >> > benefit both classic topics and diskless topics, covering
> both
> > >> >> high
> > >> >> >> and
> > >> >> >> >> low
> > >> >> >> >> > throughput partitions.
> > >> >> >> >> > * Remote log segments may be optimized over much longer
> time
> > >> >> windows
> > >> >> >> >> > rather than performing optimization once in the first few
> > >> hours of
> > >> >> >> the
> > >> >> >> >> life
> > >> >> >> >> > of a WAL segment and then freezing the arrangement of the
> data
> > >> >> until
> > >> >> >> it
> > >> >> >> >> is
> > >> >> >> >> > deleted.
> > >> >> >> >> > * File merging will need to rely on heuristics, which
> should be
> > >> >> >> >> > configurable by the user. Multi-partition heuristics are
> more
> > >> >> >> >> complicated
> > >> >> >> >> > to describe and reason about than single-partition
> heuristics.
> > >> >> >> >> > What do you think of this alternative?
> > >> >> >> >> >
> > >> >> >> >> > JR2:
> > >> >> >> >> > Yes, the current default partition assignment strategy will
> > >> need
> > >> >> some
> > >> >> >> >> > improvement. This problem with Diskless WAL segments is
> > >> analogous
> > >> >> to
> > >> >> >> the
> > >> >> >> >> > Classic topics’ dense inter-broker connection graph.
> > >> >> >> >> > The natural solution to this seems to be some sort of
> cellular
> > >> >> >> design,
> > >> >> >> >> > where the replica placements tend to locate partitions in
> > >> similar
> > >> >> >> >> groups.
> > >> >> >> >> > Partitions in the same cell can generally share the same
> WAL
> > >> >> Segments
> > >> >> >> >> and
> > >> >> >> >> > the same Diskless Coordinator requests. This would also
> benefit
> > >> >> >> Classic
> > >> >> >> >> > topics, which would need fewer connections and fetch
> requests.
> > >> >> >> >> > Such a feature is out-of-scope of this KIP, and either we
> will
> > >> >> >> publish a
> > >> >> >> >> > follow-up KIP, or let operators and community tooling
> address
> > >> >> this.
> > >> >> >> >> >
> > >> >> >> >> > JR3:
> > >> >> >> >> > Yes we will replace the ISR/ELR election logic for diskless
> > >> >> topics,
> > >> >> >> as
> > >> >> >> >> > they no longer rely on replicas for data integrity. We will
> > >> fully
> > >> >> >> model
> > >> >> >> >> the
> > >> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and
> choose
> > >> how
> > >> >> we
> > >> >> >> >> > display this to clients.
> > >> >> >> >> > For backwards compatibility, clients using older metadata
> > >> requests
> > >> >> >> >> should
> > >> >> >> >> > see diskless topics, but interpret them as classic topics.
> We
> > >> >> could
> > >> >> >> tell
> > >> >> >> >> > older clients that the leader is in the ISR, even if it
> just
> > >> >> started
> > >> >> >> >> > building its cache.
> > >> >> >> >> > For clients using the latest metadata, they should see the
> true
> > >> >> >> state of
> > >> >> >> >> > the diskless partition: which nodes can accept
> > >> >> >> produce/fetch/sharefetch
> > >> >> >> >> > requests, which ranges of offsets are cached on-broker,
> etc.
> > >> This
> > >> >> >> could
> > >> >> >> >> > also be used to break apart the “leader” field into more
> > >> granular
> > >> >> >> >> fields,
> > >> >> >> >> > now that leadership has changed meaning.
> > >> >> >> >> >
> > >> >> >> >> > JR4:
> > >> >> >> >> > Yes, we can replace the empty fetch requests to the leader
> > >> nodes
> > >> >> with
> > >> >> >> >> > cache hint fields in the requests to the Diskless
> Coordinator,
> > >> and
> > >> >> >> rely
> > >> >> >> >> on
> > >> >> >> >> > the coordinator to distribute cache hints to all replicas.
> This
> > >> >> >> should
> > >> >> >> >> be
> > >> >> >> >> > low-overhead, and eliminate the inter-broker communication
> for
> > >> >> >> brokers
> > >> >> >> >> > which only host Diskless topics.
> > >> >> >> >> >
> > >> >> >> >> > JR5.1:
> > >> >> >> >> > You are correct and this text was ambiguous, only
> specifying
> > >> that
> > >> >> the
> > >> >> >> >> > controller waits for the sync to be complete. This section
> is
> > >> now
> > >> >> >> >> updated
> > >> >> >> >> > to explicitly say that local segments are built from object
> > >> >> storage.
> > >> >> >> >> >
> > >> >> >> >> > JR5.2:
> > >> >> >> >> > Extending the JR2 discussion, reassignment of diskless
> topics
> > >> >> would
> > >> >> >> >> > generally happen within a cell, where the marginal cost of
> > >> >> reading an
> > >> >> >> >> > additional partition is very low. When cells are
> re-balanced
> > >> and a
> > >> >> >> >> > partition is migrated between cells, there is a brief time
> > >> (until
> > >> >> the
> > >> >> >> >> next
> > >> >> >> >> > Tiered Storage segment roll) when the marginal cost is
> doubled.
> > >> >> This
> > >> >> >> >> should
> > >> >> >> >> > be infrequent and well-amortized by other topics which
> aren’t
> > >> >> being
> > >> >> >> >> > re-balanced between cells.
> > >> >> >> >> >
> > >> >> >> >> > JR6.1:
> > >> >> >> >> > We plan to move data from Diskless to Tiered Storage. Once
> the
> > >> >> data
> > >> >> >> is
> > >> >> >> >> in
> > >> >> >> >> > Tiered Storage, it can be compacted using the functionality
> > >> >> >> described in
> > >> >> >> >> > KIP-1272 [1]
> > >> >> >> >> >
> > >> >> >> >> > JR6.2:
> > >> >> >> >> > We will add details for this soon.
> > >> >> >> >> >
> > >> >> >> >> > JR7:
> > >> >> >> >> > We specify the requirement of eventual consistency to allow
> > >> >> Diskless
> > >> >> >> >> > Topics to be used with other object storage implementations
> > >> which
> > >> >> >> aren’t
> > >> >> >> >> > the three major public clouds, such as self-managed
> software or
> > >> >> >> weaker
> > >> >> >> >> > consistency caches.
> > >> >> >> >> >
> > >> >> >> >> > Thanks,
> > >> >> >> >> > Greg
> > >> >> >> >> >
> > >> >> >> >> > [1]
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage
> > >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$
> > >> >
> > >> >> >> >> >
> > >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <
> > >> >> [email protected]
> > >> >> >> >
> > >> >> >> >> > wrote:
> > >> >> >> >> >
> > >> >> >> >> >> Hi, Ivan,
> > >> >> >> >> >>
> > >> >> >> >> >> Thanks for the KIP. A few comments below.
> > >> >> >> >> >>
> > >> >> >> >> >> JR1. I am concerned about the usage of the current tiered
> > >> >> storage to
> > >> >> >> >> >> control the number of small WAL files. Current tiered
> storage
> > >> >> only
> > >> >> >> >> tiers
> > >> >> >> >> >> the data when a segment rolls, which can take hours. This
> > >> causes
> > >> >> >> three
> > >> >> >> >> >> problems. (1) Much more metadata needs to be stored and
> > >> >> maintained,
> > >> >> >> >> which
> > >> >> >> >> >> increases the cost. Suppose that each segment rolls every
> 5
> > >> >> hours,
> > >> >> >> each
> > >> >> >> >> >> partition generates 2 WAL files per second and each WAL
> file's
> > >> >> >> metadata
> > >> >> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K *
> 2 *
> > >> 100
> > >> >> =
> > >> >> >> >> 3.6MB
> > >> >> >> >> >> of
> > >> >> >> >> >> metadata. In a cluster with 100K partitions, this
> translates
> > >> to
> > >> >> >> 360GB
> > >> >> >> >> of
> > >> >> >> >> >> metadata stored on the diskless coordinators. (2) A
> > >> catching-up
> > >> >> >> >> consumer's
> > >> >> >> >> >> performance degrades since it's forced to read data from
> many
> > >> >> small
> > >> >> >> WAL
> > >> >> >> >> >> files. (3) The data in WAL files could be retained much
> longer
> > >> >> than
> > >> >> >> >> >> retention time. Since the small WAL files aren't
> completely
> > >> >> deleted
> > >> >> >> >> until
> > >> >> >> >> >> all partitions' data in it are obsolete, the deletion of
> the
> > >> WAL
> > >> >> >> files
> > >> >> >> >> >> could be delayed by hours or more. If the WAL file
> includes a
> > >> >> >> partition
> > >> >> >> >> >> with a low retention time, the retention contract could be
> > >> >> violated
> > >> >> >> >> >> significantly. The earlier design of the KIP included a
> > >> separate
> > >> >> >> object
> > >> >> >> >> >> merging process that combines small WAL files much more
> > >> >> aggressively
> > >> >> >> >> than
> > >> >> >> >> >> tiered storage, which seems to be a much better choice.
> > >> >> >> >> >>
> > >> >> >> >> >> JR2. I don't think the current default partition
> assignment
> > >> >> strategy
> > >> >> >> >> for
> > >> >> >> >> >> classic topics works for diskless topics. Current strategy
> > >> tries
> > >> >> to
> > >> >> >> >> spread
> > >> >> >> >> >> the replicas to as many brokers as possible. For example,
> if a
> > >> >> >> broker
> > >> >> >> >> has
> > >> >> >> >> >> 100 partitions, their replicas could be spread over 100
> > >> brokers.
> > >> >> If
> > >> >> >> the
> > >> >> >> >> >> broker generates a WAL file with 100 partitions, this WAL
> file
> > >> >> will
> > >> >> >> be
> > >> >> >> >> >> read
> > >> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of
> the
> > >> cost
> > >> >> of
> > >> >> >> S3
> > >> >> >> >> >> put.
> > >> >> >> >> >> This assignment strategy will increase the S3 cost by
> about
> > >> 8X,
> > >> >> >> which
> > >> >> >> >> is
> > >> >> >> >> >> prohibitive. We need to design a cost effective assignment
> > >> >> strategy
> > >> >> >> for
> > >> >> >> >> >> diskless topics.
> > >> >> >> >> >>
> > >> >> >> >> >> JR3. We need to think through the leade election logic
> with
> > >> >> diskless
> > >> >> >> >> >> topic.
> > >> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, but
> it
> > >> >> doesn't
> > >> >> >> >> seem
> > >> >> >> >> >> very natural.
> > >> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In
> the
> > >> >> >> diskless
> > >> >> >> >> >> topic, the KIP says that a leader could be out of sync.
> > >> >> >> >> >> JR3.2 The existing leader election logic based on ISR/ELR
> > >> mainly
> > >> >> >> >> retries
> > >> >> >> >> >> to
> > >> >> >> >> >> preserve previously acknowledged data. With diskless
> topics,
> > >> >> since
> > >> >> >> the
> > >> >> >> >> >> object store provides durability, this logic seems no
> longer
> > >> >> needed.
> > >> >> >> >> The
> > >> >> >> >> >> existing min.isr and unclean leader election logic also
> don't
> > >> >> apply.
> > >> >> >> >> >>
> > >> >> >> >> >> JR4. "Despite that there is no inter-broker replication,
> > >> replicas
> > >> >> >> will
> > >> >> >> >> >> still issue FetchRequest to leaders. Leaders will respond
> with
> > >> >> empty
> > >> >> >> >> (no
> > >> >> >> >> >> records) FetchResponse."
> > >> >> >> >> >> This seems unnatural. Could we avoid issuing inter broker
> > >> fetch
> > >> >> >> >> requests
> > >> >> >> >> >> for diskless topics?
> > >> >> >> >> >>
> > >> >> >> >> >> JR5. "The replica reassignment will follow the same flow
> as in
> > >> >> >> classic
> > >> >> >> >> >> topic:".
> > >> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response is
> alway
> > >> >> >> empty,
> > >> >> >> >> it
> > >> >> >> >> >> doesn't seem the current reassignment flow works for
> diskless
> > >> >> topic.
> > >> >> >> >> Also,
> > >> >> >> >> >> since the source of the data is object store, it seems
> more
> > >> >> natural
> > >> >> >> >> for a
> > >> >> >> >> >> replica to back fill the data from the object store,
> instead
> > >> of
> > >> >> >> other
> > >> >> >> >> >> replicas. This will also incur lower costs.
> > >> >> >> >> >> JR5.2 How do we prevent reassignment on diskless topics
> from
> > >> >> causing
> > >> >> >> >> the
> > >> >> >> >> >> same cost issue described in JR2?
> > >> >> >> >> >>
> > >> >> >> >> >> JR6." In other functional aspects, diskless topics are
> > >> >> >> >> indistinguishable
> > >> >> >> >> >> from classic topics. This includes durability guarantees,
> > >> >> ordering
> > >> >> >> >> >> guarantees, transactional and non-transactional producer
> API,
> > >> >> >> consumer
> > >> >> >> >> >> API,
> > >> >> >> >> >> consumer groups, share groups, data retention (deletion &
> > >> >> compact),"
> > >> >> >> >> >> JR6.1 Could you describe how compact diskless topics are
> > >> >> supported?
> > >> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the
> > >> transactional
> > >> >> >> >> support in
> > >> >> >> >> >> detail.
> > >> >> >> >> >>
> > >> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and
> > >> >> eventually
> > >> >> >> >> >> consistent storage supporting arbitrary sized byte values
> and
> > >> a
> > >> >> >> minimal
> > >> >> >> >> >> set
> > >> >> >> >> >> of atomic operations: put, delete, list, and ranged get."
> > >> >> >> >> >> It seems that the object storage in all three major public
> > >> clouds
> > >> >> >> are
> > >> >> >> >> >> strongly consistent.
> > >> >> >> >> >>
> > >> >> >> >> >> Jun
> > >> >> >> >> >>
> > >> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <
> [email protected]
> > >> >
> > >> >> >> wrote:
> > >> >> >> >> >>
> > >> >> >> >> >> > Hi all,
> > >> >> >> >> >> >
> > >> >> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's
> now
> > >> >> focus on
> > >> >> >> >> the
> > >> >> >> >> >> > technical details presented in this KIP-1163 and also in
> > >> >> KIP-1164:
> > >> >> >> >> >> Diskless
> > >> >> >> >> >> > Coordinator  [1].
> > >> >> >> >> >> >
> > >> >> >> >> >> > Best,
> > >> >> >> >> >> > Ivan
> > >> >> >> >> >> >
> > >> >> >> >> >> > [1]
> > >> >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
> > >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDZKiPB2A$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$
> > >> >
> > >> >> >> >> >> >
> > >> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
> > >> >> >> >> >> > > Hi all!
> > >> >> >> >> >> > >
> > >> >> >> >> >> > > We want to start the discussion thread for KIP-1163:
> > >> Diskless
> > >> >> >> Core
> > >> >> >> >> >> [1],
> > >> >> >> >> >> > which is a sub-KIP for KIP-1150 [2].
> > >> >> >> >> >> > >
> > >> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for
> > >> high-level
> > >> >> >> >> >> questions,
> > >> >> >> >> >> > motivation, and general direction of the feature and
> this
> > >> >> thread
> > >> >> >> for
> > >> >> >> >> >> > particular details of implementation.
> > >> >> >> >> >> > >
> > >> >> >> >> >> > > Best,
> > >> >> >> >> >> > > Ivan
> > >> >> >> >> >> > >
> > >> >> >> >> >> > > [1]
> > >> >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
> > >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDrNzi-QI$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$
> > >> >
> > >> >> >> >> >> > > [2]
> > >> >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >>
> > >> >> >>
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDgFavpPM$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$
> > >> >
> > >> >> >> >> >> > > [3]
> > >> >> >> >>
> https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
> > >> <
> https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND75I4_MY$
> >
> > >> >> <
> > >>
> https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$
> > >> >
> > >> >> >> >> >> >
> > >> >> >> >> >>
> > >> >> >> >> >
> > >> >> >> >>
> > >> >> >> >
> > >> >> >>
> > >> >> >
> > >> >>
> > >> >
> > >>
> > >
>

Re: [DISCUSS] KIP-1163: Diskless Core

Reply via email to