Re: [DISCUSS] KIP-1163: Diskless Core

Greg Harris via dev Fri, 19 Jun 2026 11:01:47 -0700

Hi Jun,

> That seems more complex to me than managing all metadata in a single
component.


Having multiple components here has benefits also:
1. One of the components is already built, and is sufficient for Classic
use cases.
2. Different components can have their API and performance optimized to
meet role-specific requirements.

We may find that having a monolithic metadata component capable of serving
both hot Diskless and archival roles has to compromise on one to serve the
other, and have a more complex API overall. In our testing, we have found
that lagging consumers add excessive query load and cache pressure to the
Diskless subsystem, while those traffic patterns are very well served by
Tiered Storage.

> It would be useful to think through how things like transactions work.

As I understand it, current Tiered Storage only copies data earlier than
the LSO, in order to simplify reasoning about transactions. We can maintain
that separation, and contain transactional logic to the Diskless
coordinator.

I don't understand why the producer state and transaction index would be
duplicated here, if they're necessary for Classic Tiered topics, I would
expect them to be necessary for Diskless too.

> it would be useful to think through how to migrate all the data, not just
the tiered portion of the data.

As Diskless and Classic data ages, it will eventually be eligible for
tiering. At that point, the storage will converge to the new storage type
as if no change had occurred.

> If you go down this path [of optimizing small segments in the RLMM]...

Yes, I agree that we would need to look into multi-partition segments and
cutting down on the metadata amplification.

Currently it looks like the design is moving away from storing small
segments in Tiered Storage because these optimizations would be too
invasive, and instead merging segments within the Diskless layer.

I will work to update the KIP with our latest understanding of the role
Tiered Storage will play in the Diskless design.

Thanks,
Greg



On Fri, May 15, 2026, 12:51 PM Jun Rao <[email protected]> wrote:

> Hi, Greg and Victor,
>
> Thanks for the reply.
>
> "We can build the merging step to optimize WAL segments for more
> predictable rebuild times. But could we still perform a final move to
> Tiered Storage after each partition reaches the configured roll times? We
> could expect the same load/sizing expectations as classic topics (e.g. >1gb
> segments)."
>  In this model, the object metadata is managed in two places: the diskless
> coordinator and RLMM. That seems more complex to me than managing all
> metadata in a single component. It would be useful to think through how
> things like transactions work. I assume the diskless coordinator needs to
> store the producer states and aborted transactions. If that is the case,
> the producer state and transaction index uploaded as part of the tier
> segment seem redundant.
>
> "We are interested in unifying with Tiered Storage for many reasons, but
> also so that topics which have diskless mode dynamically enabled/disabled
> can eventually converge to a predictable state."
> If we want to support dynamically enabling/disabling diskless topic, it
> would be useful to think through how to migrate all the data, not just the
> tiered portion of the data.
>
> "(b) We can manage the RLMM weakness 2 ways:
>   (i) improve the RLMM with snapshotting so it handles smaller log files
> better
>   (ii) merge tiered storage segments with UploadPartCopy-like features or
> with concatenating "on the fly" without using any disk and minimal RAM
> (typically an UploadPartCopy has the same cost as a PUT). Index files need
> to be adjusted though."
> If you go down this path, I think you need to address at least 2
> additional issues: (1) ability to tier multiple partitions in a single
> object (for cost optimization); and (2) avoiding the blind propagation of
> all metadata to every broker.
>
> Jun
>
>
> On Fri, May 15, 2026 at 6:24 AM Viktor Somogyi-Vass <
> [email protected]> wrote:
>
>> Hi All,
>>
>> I would tie JR1 and JR11 together.
>>
>> From Jun:
>> By "the first approach", do you
>> mean aggressive tiering with faster segment rolling through the existing
>> RLMM? I don't think the existing RLMM is designed to solve these issues
>> due
>> to inefficiencies in cost, metadata propagation and metadata storage as we
>> previously discussed.
>>
>> From Satish:
>> RLMM was not designed for aggressive copying of the latest data to
>> tiered storage by having small segment rollouts.
>>
>> From Luke:
>> I personally quite like the idea of delegating the tiny objects merging
>> task to tiered storage.
>> Sadly, there are some drawbacks that Jun pointed out.
>> I agree that if we are using the aggressive tiering object solution, it
>> might de-prioritize or delay progress of the classic tiered storage
>> topics.
>>
>> Sorry, I realize now that "aggressive tiereing" was a confusing sentence,
>> I meant solution (A) in my previous email. I was just saying that if we can
>> decouple RLMM from diskless by using classic local logs to cache segments
>> then we should be able to approximate the 87.5% cost saving target
>> relatively well and create a bridge between diskless and tiered logs. Not
>> saying this is the best solution because the RLMM bottleneck would still
>> exist, but it is an option and I think it would be a good basis for an
>> improvement that fixes these shortcomings.
>> My reasons are the following:
>> (a) Using the tiered storage framework has the advantage that existing
>> integrations would fit into the diskless framework, but also it would be
>> possible to switch between topic types. So a classic topic could be
>> reconfigured to have a diskless head and vice versa. This gives the project
>> great flexibility and compatibility with the existing features. Separating
>> diskless storage entirely without data being able to cross this border
>> would ultimately create a competing logging layer inside the project which
>> may not be beneficial in the long term.
>> (b) We can manage the RLMM weakness 2 ways:
>>   (i) improve the RLMM with snapshotting so it handles smaller log files
>> better
>>   (ii) merge tiered storage segments with UploadPartCopy-like features or
>> with concatenating "on the fly" without using any disk and minimal RAM
>> (typically an UploadPartCopy has the same cost as a PUT). Index files need
>> to be adjusted though.
>> (c) cost-wise it seems very similar to diskless merging while having the
>> advantages above.
>>
>> Compared to this, WAL merging:although might be marginally cheaper, it
>> creates a competing log layer with no crossing between this and classic
>> logs easily, but also won't be able to create optimal logs as merged
>> segments would be mixed (if we just assume a concatenation merging
>> strategy).
>>
>> I wouldn't do both solutions though, I agree with Luke in that one of
>> them ideally would be enough to achieve the read optimization goal,
>> although I can see that if we go with WAL merging, then in the future the
>> need to cross these logging forms may appear which we may get relatively
>> cheaply by improving RLMM to be able to handle this traffic.
>>
>> Thanks,
>> Viktor
>>
>> On Fri, May 15, 2026 at 5:52 AM Luke Chen <[email protected]> wrote:
>>
>>> Hi Greg,
>>>
>>> I personally quite like the idea of delegating the tiny objects merging
>>> task to tiered storage.
>>> Sadly, there are some drawbacks that Jun pointed out.
>>> I agree that if we are using the aggressive tiering object solution, it
>>> might de-prioritize or delay progress of the classic tiered storage
>>> topics.
>>>
>>> > We can build the merging step to optimize WAL segments for more
>>> predictable
>>> rebuild times. But could we still perform a final move to Tiered Storage
>>> after each partition reaches the configured roll times?
>>>
>>> I think you have your imagined use cases in the future.
>>> But it doesn't make sense when you finally merge a 500 tiny small objects
>>> into one big WAL segment, then you get rid of it and upload another copy
>>> of
>>> log segment onto remote storage via tiered storage. Maybe you can
>>> consider
>>> directly appending new metadata into RLMM to point to the location of the
>>> merged WAL segments?
>>>
>>>
>>> Thank you,
>>> Luke
>>>
>>> On Fri, May 15, 2026 at 5:11 AM Greg Harris via dev <
>>> [email protected]>
>>> wrote:
>>>
>>> > Jun & Satish,
>>> >
>>> > We can build the merging step to optimize WAL segments for more
>>> predictable
>>> > rebuild times. But could we still perform a final move to Tiered
>>> Storage
>>> > after each partition reaches the configured roll times? We could
>>> expect the
>>> > same load/sizing expectations as classic topics (e.g. >1gb segments).
>>> >
>>> > We are interested in unifying with Tiered Storage for many reasons, but
>>> > also so that topics which have diskless mode dynamically
>>> enabled/disabled
>>> > can eventually converge to a predictable state.
>>> >
>>> > Thanks,
>>> > Greg
>>> >
>>> > On Wed, May 13, 2026, 3:56 AM Satish Duggana <[email protected]
>>> >
>>> > wrote:
>>> >
>>> > > RLMM was not designed for aggressive copying of the latest data to
>>> > > tiered storage by having small segment rollouts.
>>> > >
>>> > > +1 to Jun on leaving the existing RLMM for classic topics with tiered
>>> > > storage and having an efficient metadata management system required
>>> > > for diskless topics.
>>> > >
>>> > >
>>> > > On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]>
>>> > > wrote:
>>> > > >
>>> > > > Hi, Victor,
>>> > > >
>>> > > > Thanks for the reply.
>>> > > >
>>> > > > JR1. (A) and (B) Yes, your summary matches my thinking.
>>> > > > (C) "Generally I think that (i) (ii) (iii) and (iv) may be
>>> addressed
>>> > with
>>> > > > an aggressive tiered storage consolidation (the first approach)".
>>> > > > Hmm, I am confused by the above statement. By "the first
>>> approach", do
>>> > > you
>>> > > > mean aggressive tiering with faster segment rolling through the
>>> > existing
>>> > > > RLMM? I don't think the existing RLMM is designed to solve these
>>> issues
>>> > > due
>>> > > > to inefficiencies in cost, metadata propagation and metadata
>>> storage as
>>> > > we
>>> > > > previously discussed.
>>> > > >
>>> > > > JR11. I was thinking we leave the existing RLMM as is and continue
>>> to
>>> > use
>>> > > > it for classic topics. We design a new, more efficient metadata
>>> > > management
>>> > > > component independent of RLMM. This new component will be the only
>>> > > metadata
>>> > > > component that diskless topics depend on.
>>> > > >
>>> > > > Jun
>>> > > >
>>> > > > On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass <
>>> [email protected]
>>> > >
>>> > > > wrote:
>>> > > >
>>> > > > > Hi Jun,
>>> > > > >
>>> > > > > JR1
>>> > > > > (1)-(2)-(3) I'd address these together and let me explain our
>>> current
>>> > > idea
>>> > > > > to solve the tiny object problem because I'm not sure if we're
>>> 100%
>>> > > talking
>>> > > > > about the same thing. I have two approaches in mind for TS
>>> > > consolidation
>>> > > > > ((A) and (B)) and I'm not sure if we're both assuming the same
>>> idea,
>>> > so
>>> > > > > let's clarify this.
>>> > > > >
>>> > > > > (A)
>>> > > > > This is our current assumption. This uses local disks (create
>>> classic
>>> > > > > local logs with UnifiedLog) to consolidate logs into the classic
>>> log
>>> > > format
>>> > > > > and use RSM and RLMM to store them in tiered storage. This way
>>> we're
>>> > > not
>>> > > > > limited by the need to have short rollovers. Local logs become a
>>> form
>>> > > of
>>> > > > > staging environment to serve reads and accumulate records for
>>> tiered
>>> > > > > storage. This means that:
>>> > > > >  (a) Once a message is consolidated into the classic log format,
>>> we
>>> > can
>>> > > > > use it for serving lagging consumers. Diskless reads should
>>> really be
>>> > > used
>>> > > > > for the head of the log and after a few seconds logs should be
>>> > > consolidated.
>>> > > > >  (b) The real cost is much closer to that 87.5% (and in fact my
>>> > google
>>> > > > > sheet I shared also assumes this model) because we have more
>>> freedom
>>> > in
>>> > > > > choosing the retention parameters of the classic log.
>>> > > > >  (c) Metadata is smaller as we only need to keep diskless
>>> segments
>>> > > until
>>> > > > > the tiered offset surpasses the individual batches' offset.
>>> > > > >  (d) RLMM metadata is also somewhat manageable due to the larger
>>> > > segment
>>> > > > > sizes but it's still possible to run into the metadata explosion
>>> > > problem.
>>> > > > >  (e) It needs to rebuild this local log on reassignment to serve
>>> > > lagging
>>> > > > > consumers effectively, so reassignment is a bit more messy.
>>> > > > >  (f) It's not optimal when partitions have a single replica: on
>>> > > failure we
>>> > > > > can only fall back to diskless mode until the partition is
>>> reassigned
>>> > > to a
>>> > > > > functioning broker.
>>> > > > >
>>> > > > > (B)
>>> > > > > Compared to the above there can be an alternative approach,
>>> which is
>>> > to
>>> > > > > consolidate when diskless segments expire (after 15 minutes for
>>> > > instance).
>>> > > > > In that case your points seem to fit better as:
>>> > > > >  (a) we can only use the classic, consolidated logs to serve
>>> lagging
>>> > > > > consumers after they have been tiered
>>> > > > >  (b) to be more efficient with lagging consumers we have to
>>> stick to
>>> > a
>>> > > > > short rollover
>>> > > > >  (c) it's more costly due to the short rollovers
>>> > > > >  (d) the RLMM bottleneck still exists due to the short rollovers
>>> > > > >  (e) it's not given whether we use local disks for transforming
>>> logs
>>> > > as we
>>> > > > > can do it in memory too (which can be ineffective and more
>>> expensive)
>>> > > but
>>> > > > > perhaps a “chunked transfer encoding” that S3 supports or similar
>>> > with
>>> > > > > other providers is a cost effective way. If we know the final
>>> size
>>> > > advance,
>>> > > > > we can upload data in chunks and still get billed for 1 put.
>>> > > > >  (f) more efficient reassignment or failover is cleaner and
>>> faster as
>>> > > > > there isn't a need to rebuild local caches.
>>> > > > >
>>> > > > > (C)
>>> > > > > Apart from the first 2 approaches there is a 3rd, which is WAL
>>> > > merging. To
>>> > > > > understand your points, let me summarize that I could gather so
>>> far
>>> > as
>>> > > > > reasons for WAL merging (and please correct me if I missed
>>> > something):
>>> > > > >  (i) protecting consumer lag: small WAL files create inefficient
>>> > > objects
>>> > > > > for lagging consumers, so larger objects should be more efficient
>>> > > > >  (ii) avoiding the RLMM replay bottleneck: managing small
>>> segments
>>> > with
>>> > > > > RLMM is very inefficient (100s of GB metadata)
>>> > > > >  (iii) reducing batch metadata overhead: merging WAL files may
>>> reduce
>>> > > the
>>> > > > > metadata we need to store, but it depends on the merge algorithm
>>> and
>>> > > how we
>>> > > > > can compact batch data
>>> > > > >  (iv) cost effectiveness: retrieving merged WAL files reduces the
>>> > > number
>>> > > > > of get requests to object storage
>>> > > > >  (v) architectural redundancy with RLMM: ideally we wouldn't
>>> need 2
>>> > > > > solutions to 2 somewhat similar problems (tiered storage and
>>> > diskless)
>>> > > > >
>>> > > > > Generally I think that (i) (ii) (iii) and (iv) may be addressed
>>> with
>>> > an
>>> > > > > aggressive tiered storage consolidation (the first approach), so
>>> the
>>> > > only
>>> > > > > remaining gap would be (v). I also agree that having 2 different
>>> > > solutions
>>> > > > > for metadata handling isn't ideal and perhaps there is a
>>> possibility
>>> > of
>>> > > > > improvement here. It should be possible to redesign RLMM to be
>>> more
>>> > > similar
>>> > > > > to the diskless coordinator or design a common solution.
>>> > > > >
>>> > > > > JR11
>>> > > > > "If we support merging in the diskless coordinator, I wonder how
>>> > useful
>>> > > > > RLMM
>>> > > > > is. It seems simpler to manage all metadata from the object
>>> store in
>>> > a
>>> > > > > single place."
>>> > > > >
>>> > > > > Could you please clarify this a little bit? Do you think that we
>>> > should
>>> > > > > replace the RLMM with a solution that is more similar to the
>>> diskless
>>> > > > > coordinator or deprecate tiered storage altogether in favor of
>>> > > diskless?
>>> > > > > I'm not sure which option you're referring:
>>> > > > >  (1) Unify tiered storage and diskless under a single storage
>>> layer
>>> > > (and
>>> > > > > possibly deprecate tiered storage in favor of diskless with
>>> merging
>>> > WAL
>>> > > > > segments).
>>> > > > >  (2) Create a smart coordinator instead of RLMM and possibly
>>> unify
>>> > > > > metadata coordination with diskless.
>>> > > > >  (3) Keep tiered storage and diskless separate with their own
>>> > solutions
>>> > > > > for metadata (probably not optimal).
>>> > > > >
>>> > > > > Thanks,
>>> > > > > Viktor
>>> > > > >
>>> > > > > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <
>>> [email protected]
>>> > >
>>> > > > > wrote:
>>> > > > >
>>> > > > >> Hi, Viktor and Greg,
>>> > > > >>
>>> > > > >> Thanks for the reply.
>>> > > > >>
>>> > > > >> JR1.
>>> > > > >> 1) Thanks for verifying the cost estimation. I noticed a bug in
>>> my
>>> > > earlier
>>> > > > >> calculation. I estimated the per broker network transfer rate at
>>> > > 2MB/sec.
>>> > > > >> It should be 4MB/sec. If I correct it, the estimated savings are
>>> > > similar
>>> > > > >> to
>>> > > > >> yours.
>>> > > > >> The cost for transferring 4MB through the network is 4 * 2 *
>>> 10^-5 =
>>> > > $8*
>>> > > > >> 10^-5
>>> > > > >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The
>>> savings
>>> > > are
>>> > > > >> about 87.5%.
>>> > > > >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The
>>> savings
>>> > > are
>>> > > > >> 62.5%.
>>> > > > >> Savings are still significantly lower when using RLMM.
>>> > > > >>
>>> > > > >> "To me it seems like that Greg's previous suggestion for a 15
>>> min
>>> > > rollover
>>> > > > >> may be a bit too much. With 1 hour we can achieve better cost
>>> saving
>>> > > and
>>> > > > >> less coordinate metadata being stored."
>>> > > > >> This solves the cost issue, but it has other implications (see
>>> point
>>> > > 2)
>>> > > > >> below).
>>> > > > >>
>>> > > > >> 2) "Yes, I think this is to be expected and a lot depends on the
>>> > > > >> implementation. Ideally segments or chunks should be cached to
>>> > > minimize
>>> > > > >> the
>>> > > > >> number of times segments pulled from remote storage."
>>> > > > >> In a classic topic, when a consumer lags, its requests are
>>> served
>>> > > either
>>> > > > >> from the local cache or from large objects in the object store.
>>> With
>>> > > the
>>> > > > >> current design in a diskless topic, lagging consumer requests
>>> might
>>> > be
>>> > > > >> served from tiny 500-byte objects. This will significantly slow
>>> down
>>> > > the
>>> > > > >> consumer's catch-up, which is not expected user behavior.
>>> Ideally,
>>> > we
>>> > > > >> don't
>>> > > > >> want those tiny objects to last more than a few minutes, let
>>> alone
>>> > an
>>> > > > >> hour.
>>> > > > >>
>>> > > > >> 3) "I think if my calculations are correct (and we use a 60
>>> minute
>>> > > > >> window),
>>> > > > >> then metadata generation should be slower, please see the google
>>> > > sheet I
>>> > > > >> linked above. I think given that traffic, the current topic
>>> based
>>> > RLMM
>>> > > > >> should be able to handle it."
>>> > > > >> Why is a 60 minute window used? RLMM metadata needs to be
>>> retained
>>> > > for the
>>> > > > >> longest retention time among all topics. This means that the
>>> > retention
>>> > > > >> window can be weeks instead of 1 hour. This means that RLMM
>>> might
>>> > > need to
>>> > > > >> replay over 100GB of data during reassignment, which is not
>>> what it
>>> > is
>>> > > > >> designed for.
>>> > > > >>
>>> > > > >> JR10. "Your example of 100,000 1kb/s partitions is a borderline
>>> > case,
>>> > > > >> where
>>> > > > >> there are some configurations which are not viable due to scale
>>> or
>>> > > cost,
>>> > > > >> and some that are. It would be up to the operator to tune their
>>> > > cluster,
>>> > > > >> by
>>> > > > >> changing diskless.segment.ms
>>> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZDIBZhfEU$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
>>> > > >
>>> > > > >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
>>> > > > >> >,
>>> > > > >> dividing up the cluster, or switching to a more scalable RLMM
>>> > > > >> implementation."
>>> > > > >> A broker with 4MB/sec produce throughput can probably be
>>> considered
>>> > > high
>>> > > > >> throughput. Even with 4K partitions per broker, we could still
>>> > > achieve an
>>> > > > >> 87.5% cost saving as listed above, if we do the right
>>> > implementation.
>>> > > So,
>>> > > > >> ideally, it would be useful to support that as well.
>>> > > > >>
>>> > > > >> JR11. "We had a short conversation with Greg and we came to the
>>> > > conclusion
>>> > > > >> that because of the explosiveness of diskless metadata, it may
>>> be
>>> > > worth
>>> > > > >> revisiting the merging case as it can indeed buy us some more
>>> cost
>>> > > saving
>>> > > > >> for the added complexity. "
>>> > > > >> If we support merging in the diskless coordinator, I wonder how
>>> > useful
>>> > > > >> RLMM
>>> > > > >> is. It seems simpler to manage all metadata from the object
>>> store
>>> > in a
>>> > > > >> single place.
>>> > > > >>
>>> > > > >> Jun
>>> > > > >>
>>> > > > >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <
>>> [email protected]>
>>> > > wrote:
>>> > > > >>
>>> > > > >> > Hi Jun,
>>> > > > >> >
>>> > > > >> > Thank you for scrutinizing the scalability of the current
>>> > > > >> > direct-to-tiered-storage strategy, and its metadata
>>> scalability.
>>> > > > >> >
>>> > > > >> > One of our implicit assumptions with this design was that
>>> users
>>> > are
>>> > > able
>>> > > > >> > to choose between the Diskless and Classic mechanisms, and
>>> that
>>> > any
>>> > > > >> > situations where the Diskless design was deficient, the
>>> Classic
>>> > > topics
>>> > > > >> > could continue to be used.
>>> > > > >> > This was originally applied to low-latency use-cases, but now
>>> also
>>> > > > >> applies
>>> > > > >> > to low-throughput use-cases too. When the throughput on a
>>> topic is
>>> > > low,
>>> > > > >> the
>>> > > > >> > benefit of using Diskless is also low, because it is
>>> proportional
>>> > > to the
>>> > > > >> > amount of data transferred, and it is more likely that the
>>> batch
>>> > > > >> overhead
>>> > > > >> > of the topics is significant.
>>> > > > >> > In other words, we've been treating cost-effective support for
>>> > > > >> arbitrarily
>>> > > > >> > low throughput topics as a non-goal.
>>> > > > >> >
>>> > > > >> > Your example of 100,000 1kb/s partitions is a borderline case,
>>> > where
>>> > > > >> there
>>> > > > >> > are some configurations which are not viable due to scale or
>>> cost,
>>> > > and
>>> > > > >> some
>>> > > > >> > that are. It would be up to the operator to tune their
>>> cluster, by
>>> > > > >> changing
>>> > > > >> > diskless.segment.ms
>>> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZDIBZhfEU$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
>>> > > >
>>> > > > >> > <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
>>> > > > >> >,
>>> > > > >> > dividing up the cluster, or switching to a more scalable RLMM
>>> > > > >> > implementation.
>>> > > > >> >
>>> > > > >> > Do you think we should have cost-effective support for
>>> arbitrarily
>>> > > > >> > low-throughput partitions in Diskless? How much total demand
>>> is
>>> > > there in
>>> > > > >> > partitions where batches are >1kb but the partition
>>> throughput is
>>> > > > >> <1kb/s?
>>> > > > >> >
>>> > > > >> > Thanks,
>>> > > > >> > Greg
>>> > > > >> >
>>> > > > >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass <
>>> > > [email protected]
>>> > > > >> >
>>> > > > >> > wrote:
>>> > > > >> >
>>> > > > >> >> Hi Jun,
>>> > > > >> >>
>>> > > > >> >> Regarding JR1.
>>> > > > >> >> We had a short conversation with Greg and we came to the
>>> > conclusion
>>> > > > >> that
>>> > > > >> >> because of the explosiveness of diskless metadata, it may be
>>> > worth
>>> > > > >> >> revisiting the merging case as it can indeed buy us some more
>>> > cost
>>> > > > >> saving
>>> > > > >> >> for the added complexity. Also, it would support smaller
>>> topics
>>> > > and we
>>> > > > >> >> could somewhat manage the tiered storage consolidation
>>> costs. I
>>> > > think
>>> > > > >> that
>>> > > > >> >> we would still need to consolidate WAL segments into tiered
>>> > > storage.
>>> > > > >> >> Reasons are: to limit WAL metadata, to be able to dynamically
>>> > > > >> >> enable/disable diskless and to be compatible with existing
>>> and
>>> > > future
>>> > > > >> TS
>>> > > > >> >> improvements.
>>> > > > >> >> I'll try to refresh KIP-1165 and build it into the calculator
>>> > > above (if
>>> > > > >> >> it's possible at all :) ) and come back to you.
>>> > > > >> >> Regardless, I just wanted to give a short update in the
>>> meantime,
>>> > > > >> looking
>>> > > > >> >> forward to your answer.
>>> > > > >> >>
>>> > > > >> >> Best,
>>> > > > >> >> Viktor
>>> > > > >> >>
>>> > > > >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass <
>>> > > > >> >> [email protected]>
>>> > > > >> >> wrote:
>>> > > > >> >>
>>> > > > >> >> > Hi Jun,
>>> > > > >> >> >
>>> > > > >> >> > Thanks for the quick reply.
>>> > > > >> >> >
>>> > > > >> >> > JR1.
>>> > > > >> >> > 1) Thanks for putting the numbers together. While your
>>> > > calculation
>>> > > > >> >> > seems to be correct in the sense that 6 PUTs would worsen
>>> the
>>> > > cost
>>> > > > >> >> saving
>>> > > > >> >> > benefits, I think that in a byte for byte comparison there
>>> is a
>>> > > > >> bigger
>>> > > > >> >> > difference. The reason is that the 4 tiered storage puts
>>> > transfer
>>> > > > >> much
>>> > > > >> >> more
>>> > > > >> >> > data compared to the small WAL segments, so in practice
>>> there
>>> > > should
>>> > > > >> be
>>> > > > >> >> > fewer TS puts.
>>> > > > >> >> > I made a google sheet calculator for this which I'd like to
>>> > share
>>> > > > >> with
>>> > > > >> >> > you:
>>> > > > >> >> >
>>> > > > >> >>
>>> > > > >>
>>> > >
>>> >
>>> https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906
>>> <https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZD7byUYOY$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$
>>> > > > >> >
>>> > > > >> >> > Please copy the sheet to modify the values.
>>> > > > >> >> > About my findings: I was trying to create a similar cluster
>>> > model
>>> > > > >> that
>>> > > > >> >> has
>>> > > > >> >> > been discussed here previously to see how cost varies over
>>> > > different
>>> > > > >> >> > segment rollovers.To me it seems like that Greg's previous
>>> > > suggestion
>>> > > > >> >> for a
>>> > > > >> >> > 15 min rollover may be a bit too much. With 1 hour we can
>>> > achieve
>>> > > > >> better
>>> > > > >> >> > cost saving and less coordinate metadata being stored. I
>>> have
>>> > > also
>>> > > > >> >> tried to
>>> > > > >> >> > account for the producer batch metadata generated by
>>> diskless
>>> > > > >> partitions
>>> > > > >> >> > but to me it seems like a lower number than Greg's original
>>> > > numbers.
>>> > > > >> >> >
>>> > > > >> >> > 2) "Note that local storage could be lost on reassigned
>>> > > partitions.
>>> > > > >> In
>>> > > > >> >> > that case, lagging reads can only be served from the object
>>> > > store."
>>> > > > >> >> > Yes, I think this is to be expected and a lot depends on
>>> the
>>> > > > >> >> > implementation. Ideally segments or chunks should be
>>> cached to
>>> > > > >> minimize
>>> > > > >> >> the
>>> > > > >> >> > number of times segments pulled from remote storage.
>>> > > > >> >> >
>>> > > > >> >> > "The 2MB/sec I quoted is for a specific broker. Depending
>>> on
>>> > the
>>> > > > >> broker
>>> > > > >> >> > instance type, a broker may only be able to handle low 10s
>>> of
>>> > > MB/sec
>>> > > > >> of
>>> > > > >> >> > data. So, 2MB/sec overhead is significant."
>>> > > > >> >> > Yes, I have indeed misunderstood, however I have updated my
>>> > > > >> calculator
>>> > > > >> >> > sheet with metadata calculation. Overall, the number of
>>> tiered
>>> > > > >> storage
>>> > > > >> >> > segments created seems to be much lower than in your
>>> > calculations
>>> > > > >> given
>>> > > > >> >> the
>>> > > > >> >> > parameters of the cluster you specified earlier. Please
>>> take a
>>> > > look,
>>> > > > >> I'd
>>> > > > >> >> > like to really understand the thinking here because this
>>> is a
>>> > > crucial
>>> > > > >> >> point.
>>> > > > >> >> >
>>> > > > >> >> > 3) I think if my calculations are correct (and we use a 60
>>> > minute
>>> > > > >> >> window),
>>> > > > >> >> > then metadata generation should be slower, please see the
>>> > google
>>> > > > >> sheet I
>>> > > > >> >> > linked above. I think given that traffic, the current topic
>>> > based
>>> > > > >> RLMM
>>> > > > >> >> > should be able to handle it.
>>> > > > >> >> > In the case where we would need to make the RLMM capable of
>>> > > handling
>>> > > > >> a
>>> > > > >> >> > similar traffic as the diskless coordinator, then you're
>>> right,
>>> > > we
>>> > > > >> >> probably
>>> > > > >> >> > should consider how we can improve it. I think there are
>>> > multiple
>>> > > > >> >> > possibilities as you mentioned, but ideally there should
>>> be a
>>> > > common
>>> > > > >> >> > implementation for metadata coordination that could handle
>>> > these
>>> > > > >> cases.
>>> > > > >> >> >
>>> > > > >> >> > JR7.
>>> > > > >> >> > Yes, your expectation is totally reasonable, we should
>>> expect
>>> > > the get
>>> > > > >> >> and
>>> > > > >> >> > put operations to be strongly consistent for the
>>> > read-after-write
>>> > > > >> >> > scenarios. And I think that since major cloud providers
>>> give
>>> > > strongly
>>> > > > >> >> > consistent object storages, it should be sufficient for a
>>> wide
>>> > > > >> >> user-group.
>>> > > > >> >> > So we could shrink the scope of the KIP a bit this way and
>>> > avoid
>>> > > > >> adding
>>> > > > >> >> > complexity that is needed mostly on the margin.
>>> > > > >> >> > I can expect though that "list" can stay eventually
>>> consistent
>>> > > as the
>>> > > > >> >> KIP
>>> > > > >> >> > relies on it for only garbage collection where it is fine
>>> if a
>>> > > few
>>> > > > >> >> segments
>>> > > > >> >> > can be collected only in the next iteration.
>>> > > > >> >> >
>>> > > > >> >> > JR3.
>>> > > > >> >> > Since Greg hasn't replied yet, I'll try to catch up with
>>> him
>>> > and
>>> > > > >> >> formulate
>>> > > > >> >> > an answer next week.
>>> > > > >> >> >
>>> > > > >> >> > Best,
>>> > > > >> >> > Viktor
>>> > > > >> >> >
>>> > > > >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev <
>>> > > > >> [email protected]>
>>> > > > >> >> > wrote:
>>> > > > >> >> >
>>> > > > >> >> >> Hi, Victor,
>>> > > > >> >> >>
>>> > > > >> >> >> Thanks for the reply.
>>> > > > >> >> >>
>>> > > > >> >> >> JR1.
>>> > > > >> >> >> 1)  "So while it seems to be significant that we tripled
>>> the
>>> > > number
>>> > > > >> of
>>> > > > >> >> >> PUTs, cost-wise it doesn't seem to be significant."
>>> > > > >> >> >> Let's compare the savings achieved by replacing network
>>> > > replication
>>> > > > >> >> >> transfer with S3 puts in AWS.
>>> > > > >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB
>>> > > > >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 *
>>> 10^-5/request
>>> > > > >> >> >>
>>> > > > >> >> >> The KIP batches data up to 4MB. So, let's assume that we
>>> write
>>> > > 2MB
>>> > > > >> S3
>>> > > > >> >> >> objects on average.
>>> > > > >> >> >>
>>> > > > >> >> >> The cost for transferring 2MB through the network is 2 *
>>> 2 *
>>> > > 10^-5 =
>>> > > > >> >> $4*
>>> > > > >> >> >> 10^-5
>>> > > > >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5.
>>> The
>>> > > savings
>>> > > > >> >> are
>>> > > > >> >> >> about 75%.
>>> > > > >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5.
>>> The
>>> > > savings
>>> > > > >> >> are
>>> > > > >> >> >> 25%. As you can see, the savings are significantly lower.
>>> > > > >> >> >>
>>> > > > >> >> >> 2) "Therefore we could expect classic local segments to be
>>> > > present
>>> > > > >> >> which
>>> > > > >> >> >> could be used for catching up consumers."
>>> > > > >> >> >> Note that local storage could be lost on reassigned
>>> > partitions.
>>> > > In
>>> > > > >> that
>>> > > > >> >> >> case, lagging reads can only be served from the object
>>> store.
>>> > > > >> >> >>
>>> > > > >> >> >> "Regarding the amount of metadata: 2MB/sec is well below
>>> the
>>> > > 2GB/s
>>> > > > >> >> >> throughput that Greg calculated previously, so I think it
>>> > > should be
>>> > > > >> >> >> manageable for a cluster with that amount of throughput,"
>>> > > > >> >> >> It seems that you didn't make the correct comparison.
>>> 2GB/s
>>> > that
>>> > > > >> Greg
>>> > > > >> >> >> mentioned is the throughput for the whole cluster. The
>>> > 2MB/sec I
>>> > > > >> >> quoted is
>>> > > > >> >> >> for a specific broker. Depending on the broker instance
>>> type,
>>> > a
>>> > > > >> broker
>>> > > > >> >> may
>>> > > > >> >> >> only be able to handle low 10s of MB/sec of data. So,
>>> 2MB/sec
>>> > > > >> overhead
>>> > > > >> >> is
>>> > > > >> >> >> significant.
>>> > > > >> >> >>
>>> > > > >> >> >> 3) "I'd separate it from the discussion of diskless core
>>> and
>>> > > > >> perhaps we
>>> > > > >> >> >> could address it in a separate KIP as it is mostly a
>>> redesign
>>> > > of the
>>> > > > >> >> >> RLMM."
>>> > > > >> >> >> Those problems don't exist in the existing usage of RLMM.
>>> They
>>> > > > >> manifest
>>> > > > >> >> >> because diskless tries to use RLMM in a way it wasn't
>>> designed
>>> > > for
>>> > > > >> >> (there
>>> > > > >> >> >> is at least a 20X increase in metadata). It would be
>>> useful to
>>> > > > >> consider
>>> > > > >> >> >> whether fixing those problems in RLMM or using a new
>>> approach
>>> > is
>>> > > > >> >> >> better. For example, KIP-1164 already introduces a
>>> > snapshotting
>>> > > > >> >> mechanism.
>>> > > > >> >> >> Adding another snapshotting mechanism to RLMM seems
>>> redundant.
>>> > > > >> >> >>
>>> > > > >> >> >> JR7. A typical object store supports 3 operations: puts,
>>> gets
>>> > > and
>>> > > > >> >> lists.
>>> > > > >> >> >> Which operations used by diskless can be eventually
>>> > consistent?
>>> > > I'd
>>> > > > >> >> expect
>>> > > > >> >> >> that get should always see the result of the latest put.
>>> > > > >> >> >>
>>> > > > >> >> >> Jun
>>> > > > >> >> >>
>>> > > > >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass <
>>> > > > >> [email protected]
>>> > > > >> >> >
>>> > > > >> >> >> wrote:
>>> > > > >> >> >>
>>> > > > >> >> >> > Hi Jun,
>>> > > > >> >> >> >
>>> > > > >> >> >> > I'd like to add my thoughts too until Greg has time to
>>> > > respond.
>>> > > > >> >> >> >
>>> > > > >> >> >> > JR1. I also think there are shortcomings in the current
>>> > tiered
>>> > > > >> >> storage
>>> > > > >> >> >> > design, around the RLMM.
>>> > > > >> >> >> > 1) I think this is a correct observation, however if my
>>> > > > >> calculations
>>> > > > >> >> are
>>> > > > >> >> >> > correct, it actually comes down to a negligible amount
>>> of
>>> > > cost.
>>> > > > >> >> Taking
>>> > > > >> >> >> the
>>> > > > >> >> >> > AWS pricing sheet at
>>> > > > >> >> >> >
>>> > > > >> >> >>
>>> > > > >> >>
>>> > > > >>
>>> > >
>>> >
>>> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps
>>> <https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZD0HF76vc$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$
>>> > > > >> >
>>> > > > >> >> >> > it seems like the difference between 6 or 2 PUTs per
>>> second
>>> > is
>>> > > > >> ~$52
>>> > > > >> >> for
>>> > > > >> >> >> a
>>> > > > >> >> >> > month. The calculation follows
>>> > > > >> >> >> > as:
>>> > 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84.
>>> > > So
>>> > > > >> >> while
>>> > > > >> >> >> it
>>> > > > >> >> >> > seems to be significant that we tripled the number of
>>> PUTs,
>>> > > > >> >> cost-wise it
>>> > > > >> >> >> > doesn't seem to be significant.
>>> > > > >> >> >> > 2) Reflecting to your original problem: the tiered
>>> storage
>>> > > > >> >> consolidation
>>> > > > >> >> >> > process should be continuously running and transforming
>>> WAL
>>> > > > >> segments
>>> > > > >> >> >> into
>>> > > > >> >> >> > classic logs. Therefore we could expect classic local
>>> > > segments to
>>> > > > >> be
>>> > > > >> >> >> > present which could be used for catching up consumers.
>>> So
>>> > they
>>> > > > >> would
>>> > > > >> >> >> only
>>> > > > >> >> >> > switch to WAL reading when they're close to the end of
>>> the
>>> > > log.
>>> > > > >> Since
>>> > > > >> >> >> this
>>> > > > >> >> >> > offset space should be cached, the reads from there
>>> should
>>> > be
>>> > > > >> fast.
>>> > > > >> >> >> > Regarding the amount of metadata: 2MB/sec is well below
>>> the
>>> > > 2GB/s
>>> > > > >> >> >> > throughput that Greg calculated previously, so I think
>>> it
>>> > > should
>>> > > > >> be
>>> > > > >> >> >> > manageable for a cluster with that amount of throughput,
>>> > > although
>>> > > > >> I
>>> > > > >> >> >> agree
>>> > > > >> >> >> > with your comment that the current topic based tiered
>>> > metadata
>>> > > > >> >> manager
>>> > > > >> >> >> > isn't optimal and we could develop a better solution.
>>> > > > >> >> >> > 3) Tied to the previous point, I agree that your
>>> comments
>>> > are
>>> > > > >> >> absolutely
>>> > > > >> >> >> > valid, however similarly to that, I'd separate it from
>>> the
>>> > > > >> >> discussion of
>>> > > > >> >> >> > diskless core and perhaps we could address it in a
>>> separate
>>> > > KIP as
>>> > > > >> >> it is
>>> > > > >> >> >> > mostly a redesign of the RLMM.
>>> > > > >> >> >> >
>>> > > > >> >> >> > JR2. Ack. We will raise a KIP in the near future.
>>> > > > >> >> >> >
>>> > > > >> >> >> > JR3. I'd leave answering this to Greg as I don't have
>>> too
>>> > much
>>> > > > >> >> context
>>> > > > >> >> >> on
>>> > > > >> >> >> > this one.
>>> > > > >> >> >> >
>>> > > > >> >> >> > JR7. I think this could be similar to the tiered storage
>>> > > design,
>>> > > > >> so
>>> > > > >> >> any
>>> > > > >> >> >> > coordinator operation should be strongly consistent
>>> (since
>>> > > we're
>>> > > > >> >> using
>>> > > > >> >> >> > classic topics there). Therefore the WAL segment storage
>>> > layer
>>> > > > >> could
>>> > > > >> >> be
>>> > > > >> >> >> > eventually consistent as we store its metadata in a
>>> strongly
>>> > > > >> >> consistent
>>> > > > >> >> >> > manner. I'm not sure though if this was the answer
>>> you're
>>> > > looking
>>> > > > >> >> for?
>>> > > > >> >> >> >
>>> > > > >> >> >> > Best,
>>> > > > >> >> >> > Viktor
>>> > > > >> >> >> >
>>> > > > >> >> >> >
>>> > > > >> >> >> >
>>> > > > >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev <
>>> > > > >> >> [email protected]>
>>> > > > >> >> >> > wrote:
>>> > > > >> >> >> >
>>> > > > >> >> >> >> Hi, Greg,
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> Thanks for the reply.
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> JR1. Rolling log segments every 15 minutes addresses
>>> the 3
>>> > > > >> concerns
>>> > > > >> >> I
>>> > > > >> >> >> >> listed, but it introduces some new issues because it
>>> > doesn't
>>> > > > >> quite
>>> > > > >> >> fit
>>> > > > >> >> >> the
>>> > > > >> >> >> >> design of the current tiered storage. (a) The current
>>> > tiered
>>> > > > >> storage
>>> > > > >> >> >> >> design
>>> > > > >> >> >> >> stores a single partition per object. If we roll a log
>>> > > segment
>>> > > > >> >> every 15
>>> > > > >> >> >> >> minutes, with 4K partitions per broker, this means an
>>> > > additional
>>> > > > >> 4
>>> > > > >> >> S3
>>> > > > >> >> >> puts
>>> > > > >> >> >> >> per second. The diskless design aims for 2 S3 puts per
>>> > > second.
>>> > > > >> So,
>>> > > > >> >> this
>>> > > > >> >> >> >> triples the S3 put cost and reduces the savings
>>> benefits.
>>> > (b)
>>> > > > >> With
>>> > > > >> >> Tier
>>> > > > >> >> >> >> storage, each broker essentially needs to read the tier
>>> > > metadata
>>> > > > >> >> from
>>> > > > >> >> >> all
>>> > > > >> >> >> >> tier metadata partitions if the number of user
>>> partitions
>>> > > exceeds
>>> > > > >> >> 50.
>>> > > > >> >> >> >> Assuming that we generate 100 bytes of tier metadata
>>> per
>>> > > > >> partition
>>> > > > >> >> >> every
>>> > > > >> >> >> >> 15
>>> > > > >> >> >> >> minutes. Assuming that each broker has 4K partitions
>>> and a
>>> > > > >> cluster
>>> > > > >> >> of
>>> > > > >> >> >> 500
>>> > > > >> >> >> >> brokers. Each broker needs to receive tier metadata at
>>> a
>>> > > rate of
>>> > > > >> >> 100 *
>>> > > > >> >> >> 4K
>>> > > > >> >> >> >> *
>>> > > > >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one
>>> of
>>> > the
>>> > > 50
>>> > > > >> tier
>>> > > > >> >> >> >> metadata topic partitions, it needs to send out
>>> metadata at
>>> > > 100 *
>>> > > > >> >> 4K *
>>> > > > >> >> >> 500
>>> > > > >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases
>>> > unnecessary
>>> > > > >> network
>>> > > > >> >> >> and
>>> > > > >> >> >> >> CPU overhead. (c) Tier storage doesn't support
>>> snapshots. A
>>> > > > >> >> restarted
>>> > > > >> >> >> >> broker needs to replay the tier metadata log from the
>>> > > beginning
>>> > > > >> to
>>> > > > >> >> >> build
>>> > > > >> >> >> >> the tier metadata state. Suppose that the tier
>>> metadata log
>>> > > is
>>> > > > >> kept
>>> > > > >> >> >> for 7
>>> > > > >> >> >> >> days. The total amount of tier metadata that needs to
>>> be
>>> > > > >> replayed is
>>> > > > >> >> >> 200KB
>>> > > > >> >> >> >> * 7 * 24 * 3600 = 120GB.
>>> > > > >> >> >> >> Does the merging optimization you mentioned address
>>> those
>>> > new
>>> > > > >> >> >> concerns? If
>>> > > > >> >> >> >> so, could you describe how it works?
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> JR2. It's fine to cover the default partition
>>> assignment
>>> > > strategy
>>> > > > >> >> for
>>> > > > >> >> >> >> diskless topics in a separate KIP. However, since this
>>> is
>>> > > > >> essential
>>> > > > >> >> for
>>> > > > >> >> >> >> achieving the cost saving goal, we need a solution
>>> before
>>> > > > >> releasing
>>> > > > >> >> the
>>> > > > >> >> >> >> diskless KIP.
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> JR3. Sounds good. Could you document how this work?
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> JR7. Could you describe which parts of the operation
>>> can be
>>> > > > >> >> eventually
>>> > > > >> >> >> >> consistent?
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> Jun
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris <
>>> > > > >> [email protected]>
>>> > > > >> >> >> wrote:
>>> > > > >> >> >> >>
>>> > > > >> >> >> >> > Hi Jun,
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > Thanks for your comments!
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR1:
>>> > > > >> >> >> >> > You are correct that the segment rolling
>>> configurations
>>> > are
>>> > > > >> >> currently
>>> > > > >> >> >> >> > critical to balance the scalability of Diskless and
>>> > Tiered
>>> > > > >> >> Storage,
>>> > > > >> >> >> as
>>> > > > >> >> >> >> > larger roll configurations benefit tiered storage,
>>> and
>>> > > smaller
>>> > > > >> >> roll
>>> > > > >> >> >> >> > configurations benefit Diskless.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > To address your points specifically:
>>> > > > >> >> >> >> > (1) A Diskless topic which is cost-competitive with
>>> an
>>> > > > >> equivalent
>>> > > > >> >> >> >> Classic
>>> > > > >> >> >> >> > topic will have a metadata size <1% of the data
>>> size. A
>>> > > cluster
>>> > > > >> >> >> storing
>>> > > > >> >> >> >> > 360GB of metadata will have >36TB of data under
>>> > management
>>> > > and
>>> > > > >> a
>>> > > > >> >> >> >> retention
>>> > > > >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will
>>> require
>>> > > > >> multiple
>>> > > > >> >> >> >> Diskless
>>> > > > >> >> >> >> > coordinators, which can share the load of storing the
>>> > > Diskless
>>> > > > >> >> >> metadata,
>>> > > > >> >> >> >> > and serving Diskless requests.
>>> > > > >> >> >> >> > (2) Catching up consumers are intended to be served
>>> from
>>> > > tiered
>>> > > > >> >> >> storage
>>> > > > >> >> >> >> > and local segment caches. Brokers which are building
>>> > their
>>> > > > >> local
>>> > > > >> >> >> segment
>>> > > > >> >> >> >> > caches will have to read many files, but will
>>> amortize
>>> > > those
>>> > > > >> >> reads by
>>> > > > >> >> >> >> > receiving data for multiple partitions in a single
>>> read.
>>> > > > >> >> >> >> > (3) This is a fundamental downside of storing data
>>> from
>>> > > > >> multiple
>>> > > > >> >> >> topics
>>> > > > >> >> >> >> in
>>> > > > >> >> >> >> > a single object, similar to classic segments. We can
>>> > > implement
>>> > > > >> a
>>> > > > >> >> >> >> > configurable cluster-wide maximum roll time, which
>>> would
>>> > > set
>>> > > > >> the
>>> > > > >> >> >> slowest
>>> > > > >> >> >> >> > cadence at which Tiered Storage segments are rolled
>>> from
>>> > > > >> Diskless
>>> > > > >> >> >> >> segments.
>>> > > > >> >> >> >> > If an individual partition has more aggressive roll
>>> > > settings,
>>> > > > >> it
>>> > > > >> >> may
>>> > > > >> >> >> be
>>> > > > >> >> >> >> > rolled earlier.
>>> > > > >> >> >> >> > This configuration would permit the cluster operator
>>> to
>>> > > > >> >> approximately
>>> > > > >> >> >> >> > bound the number of diskless WAL segments, which
>>> bounds
>>> > the
>>> > > > >> total
>>> > > > >> >> >> size
>>> > > > >> >> >> >> of
>>> > > > >> >> >> >> > the WAL segments, disk cache, diskless coordinator
>>> state,
>>> > > and
>>> > > > >> >> >> excessive
>>> > > > >> >> >> >> > retention window. For example, a diskless.segment.ms
>>> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZDIBZhfEU$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
>>> > > > >> >
>>> > > > >> >> of 15 minutes
>>> > > > >> >> >> >> would
>>> > > > >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to
>>> > > 1.8TB, and
>>> > > > >> >> >> permit
>>> > > > >> >> >> >> > short-retention data to be physically deleted as
>>> soon as
>>> > > ~15
>>> > > > >> >> minutes
>>> > > > >> >> >> >> after
>>> > > > >> >> >> >> > being produced.
>>> > > > >> >> >> >> > Of course, this will reduce the size of the tiered
>>> > storage
>>> > > > >> >> segments
>>> > > > >> >> >> for
>>> > > > >> >> >> >> > topics that have low throughput, and where
>>> segment.ms
>>> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZD3G92TUA$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
>>> > > > >> >
>>> > > > >> >> >
>>> > > > >> >> >> >> > diskless.segment.ms
>>> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZDIBZhfEU$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
>>> > > > >> >,
>>> > > > >> >> increasing overhead in the RLMM. We can perform
>>> > > > >> >> >> >> > merging/optimization of Tiered Storage segments to
>>> > achieve
>>> > > the
>>> > > > >> >> >> per-topic
>>> > > > >> >> >> >> > segment.ms
>>> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZD3G92TUA$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
>>> > > > >> >
>>> > > > >> >> .
>>> > > > >> >> >> >> > There were some reasons why we retracted the prior
>>> > > file-merging
>>> > > > >> >> >> >> approach,
>>> > > > >> >> >> >> > and why merging in tiered storage appears better:
>>> > > > >> >> >> >> > * Rewriting files requires mutability for existing
>>> data,
>>> > > which
>>> > > > >> >> adds
>>> > > > >> >> >> >> > complexity. Diskless batches or Remote Log Segments
>>> would
>>> > > need
>>> > > > >> to
>>> > > > >> >> be
>>> > > > >> >> >> >> made
>>> > > > >> >> >> >> > mutable, and the remote log will be made mutable in
>>> > > KIP-1272
>>> > > > >> [1]
>>> > > > >> >> >> >> > * Because a WAL Segment can contain batches from
>>> multiple
>>> > > > >> Diskless
>>> > > > >> >> >> >> > Coordinators, multiple coordinators must also be
>>> involved
>>> > > in
>>> > > > >> the
>>> > > > >> >> >> merging
>>> > > > >> >> >> >> > step. The Tiered Storage design has exclusive
>>> ownership
>>> > for
>>> > > > >> remote
>>> > > > >> >> >> log
>>> > > > >> >> >> >> > segments within the RLMM.
>>> > > > >> >> >> >> > * Diskless file merging competes for resources with
>>> > > > >> >> latency-sensitive
>>> > > > >> >> >> >> > producers and hot consumers. Tiered storage file
>>> merging
>>> > > > >> competes
>>> > > > >> >> for
>>> > > > >> >> >> >> > resources with lagging consumers, which are typically
>>> > less
>>> > > > >> latency
>>> > > > >> >> >> >> > sensitive.
>>> > > > >> >> >> >> > * Implementing merging in Tiered Storage allows this
>>> > > > >> optimization
>>> > > > >> >> to
>>> > > > >> >> >> >> > benefit both classic topics and diskless topics,
>>> covering
>>> > > both
>>> > > > >> >> high
>>> > > > >> >> >> and
>>> > > > >> >> >> >> low
>>> > > > >> >> >> >> > throughput partitions.
>>> > > > >> >> >> >> > * Remote log segments may be optimized over much
>>> longer
>>> > > time
>>> > > > >> >> windows
>>> > > > >> >> >> >> > rather than performing optimization once in the
>>> first few
>>> > > > >> hours of
>>> > > > >> >> >> the
>>> > > > >> >> >> >> life
>>> > > > >> >> >> >> > of a WAL segment and then freezing the arrangement
>>> of the
>>> > > data
>>> > > > >> >> until
>>> > > > >> >> >> it
>>> > > > >> >> >> >> is
>>> > > > >> >> >> >> > deleted.
>>> > > > >> >> >> >> > * File merging will need to rely on heuristics, which
>>> > > should be
>>> > > > >> >> >> >> > configurable by the user. Multi-partition heuristics
>>> are
>>> > > more
>>> > > > >> >> >> >> complicated
>>> > > > >> >> >> >> > to describe and reason about than single-partition
>>> > > heuristics.
>>> > > > >> >> >> >> > What do you think of this alternative?
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR2:
>>> > > > >> >> >> >> > Yes, the current default partition assignment
>>> strategy
>>> > will
>>> > > > >> need
>>> > > > >> >> some
>>> > > > >> >> >> >> > improvement. This problem with Diskless WAL segments
>>> is
>>> > > > >> analogous
>>> > > > >> >> to
>>> > > > >> >> >> the
>>> > > > >> >> >> >> > Classic topics’ dense inter-broker connection graph.
>>> > > > >> >> >> >> > The natural solution to this seems to be some sort of
>>> > > cellular
>>> > > > >> >> >> design,
>>> > > > >> >> >> >> > where the replica placements tend to locate
>>> partitions in
>>> > > > >> similar
>>> > > > >> >> >> >> groups.
>>> > > > >> >> >> >> > Partitions in the same cell can generally share the
>>> same
>>> > > WAL
>>> > > > >> >> Segments
>>> > > > >> >> >> >> and
>>> > > > >> >> >> >> > the same Diskless Coordinator requests. This would
>>> also
>>> > > benefit
>>> > > > >> >> >> Classic
>>> > > > >> >> >> >> > topics, which would need fewer connections and fetch
>>> > > requests.
>>> > > > >> >> >> >> > Such a feature is out-of-scope of this KIP, and
>>> either we
>>> > > will
>>> > > > >> >> >> publish a
>>> > > > >> >> >> >> > follow-up KIP, or let operators and community tooling
>>> > > address
>>> > > > >> >> this.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR3:
>>> > > > >> >> >> >> > Yes we will replace the ISR/ELR election logic for
>>> > diskless
>>> > > > >> >> topics,
>>> > > > >> >> >> as
>>> > > > >> >> >> >> > they no longer rely on replicas for data integrity.
>>> We
>>> > will
>>> > > > >> fully
>>> > > > >> >> >> model
>>> > > > >> >> >> >> the
>>> > > > >> >> >> >> > state/lifecycle of the diskless replicas in KRaft,
>>> and
>>> > > choose
>>> > > > >> how
>>> > > > >> >> we
>>> > > > >> >> >> >> > display this to clients.
>>> > > > >> >> >> >> > For backwards compatibility, clients using older
>>> metadata
>>> > > > >> requests
>>> > > > >> >> >> >> should
>>> > > > >> >> >> >> > see diskless topics, but interpret them as classic
>>> > topics.
>>> > > We
>>> > > > >> >> could
>>> > > > >> >> >> tell
>>> > > > >> >> >> >> > older clients that the leader is in the ISR, even if
>>> it
>>> > > just
>>> > > > >> >> started
>>> > > > >> >> >> >> > building its cache.
>>> > > > >> >> >> >> > For clients using the latest metadata, they should
>>> see
>>> > the
>>> > > true
>>> > > > >> >> >> state of
>>> > > > >> >> >> >> > the diskless partition: which nodes can accept
>>> > > > >> >> >> produce/fetch/sharefetch
>>> > > > >> >> >> >> > requests, which ranges of offsets are cached
>>> on-broker,
>>> > > etc.
>>> > > > >> This
>>> > > > >> >> >> could
>>> > > > >> >> >> >> > also be used to break apart the “leader” field into
>>> more
>>> > > > >> granular
>>> > > > >> >> >> >> fields,
>>> > > > >> >> >> >> > now that leadership has changed meaning.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR4:
>>> > > > >> >> >> >> > Yes, we can replace the empty fetch requests to the
>>> > leader
>>> > > > >> nodes
>>> > > > >> >> with
>>> > > > >> >> >> >> > cache hint fields in the requests to the Diskless
>>> > > Coordinator,
>>> > > > >> and
>>> > > > >> >> >> rely
>>> > > > >> >> >> >> on
>>> > > > >> >> >> >> > the coordinator to distribute cache hints to all
>>> > replicas.
>>> > > This
>>> > > > >> >> >> should
>>> > > > >> >> >> >> be
>>> > > > >> >> >> >> > low-overhead, and eliminate the inter-broker
>>> > communication
>>> > > for
>>> > > > >> >> >> brokers
>>> > > > >> >> >> >> > which only host Diskless topics.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR5.1:
>>> > > > >> >> >> >> > You are correct and this text was ambiguous, only
>>> > > specifying
>>> > > > >> that
>>> > > > >> >> the
>>> > > > >> >> >> >> > controller waits for the sync to be complete. This
>>> > section
>>> > > is
>>> > > > >> now
>>> > > > >> >> >> >> updated
>>> > > > >> >> >> >> > to explicitly say that local segments are built from
>>> > object
>>> > > > >> >> storage.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR5.2:
>>> > > > >> >> >> >> > Extending the JR2 discussion, reassignment of
>>> diskless
>>> > > topics
>>> > > > >> >> would
>>> > > > >> >> >> >> > generally happen within a cell, where the marginal
>>> cost
>>> > of
>>> > > > >> >> reading an
>>> > > > >> >> >> >> > additional partition is very low. When cells are
>>> > > re-balanced
>>> > > > >> and a
>>> > > > >> >> >> >> > partition is migrated between cells, there is a brief
>>> > time
>>> > > > >> (until
>>> > > > >> >> the
>>> > > > >> >> >> >> next
>>> > > > >> >> >> >> > Tiered Storage segment roll) when the marginal cost
>>> is
>>> > > doubled.
>>> > > > >> >> This
>>> > > > >> >> >> >> should
>>> > > > >> >> >> >> > be infrequent and well-amortized by other topics
>>> which
>>> > > aren’t
>>> > > > >> >> being
>>> > > > >> >> >> >> > re-balanced between cells.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR6.1:
>>> > > > >> >> >> >> > We plan to move data from Diskless to Tiered Storage.
>>> > Once
>>> > > the
>>> > > > >> >> data
>>> > > > >> >> >> is
>>> > > > >> >> >> >> in
>>> > > > >> >> >> >> > Tiered Storage, it can be compacted using the
>>> > functionality
>>> > > > >> >> >> described in
>>> > > > >> >> >> >> > KIP-1272 [1]
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR6.2:
>>> > > > >> >> >> >> > We will add details for this soon.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > JR7:
>>> > > > >> >> >> >> > We specify the requirement of eventual consistency to
>>> > allow
>>> > > > >> >> Diskless
>>> > > > >> >> >> >> > Topics to be used with other object storage
>>> > implementations
>>> > > > >> which
>>> > > > >> >> >> aren’t
>>> > > > >> >> >> >> > the three major public clouds, such as self-managed
>>> > > software or
>>> > > > >> >> >> weaker
>>> > > > >> >> >> >> > consistency caches.
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > Thanks,
>>> > > > >> >> >> >> > Greg
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > [1]
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >>
>>> > > > >> >> >>
>>> > > > >> >>
>>> > > > >>
>>> > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage
>>> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!un9dSv_YIz68PAfA6Whg7a0RIOcKdBQQShLZE73QVQHF9gbemD_qkNsM8EVAs3aLsCdw08jBwkTjpuZDeZ-PQzc$>
>>> > > > >> <
>>> > >
>>> >
>>> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$
>>> > > >
>>> > > > >> >> <
>>> > > > >>
>>> > >
>>> >
>>> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$
>>> > > > >> >
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <
>>> > > > >> >> [email protected]
>>> > > > >> >> >> >
>>> > > > >> >> >> >> > wrote:
>>> > > > >> >> >> >> >
>>> > > > >> >> >> >> >> Hi, Ivan,
>>> > > > >> >> >> >> >>
>>> > > > >> >> >> >> >> Thanks for the KIP. A few comments below.
>>> > > > >> >> >> >> >>
>>> > > > >> >> >> >> >> JR1. I am concerned about the usage of the current
>>> > tiered
>>> > > > >> >> storage to
>>> > > > >> >> >> >> >> control the number of small WAL files. Current
>>> tiered
>>> > > storage
>>> > > > >> >> only
>>> > > > >> >> >> >> tiers
>>> > > > >> >> >> >> >> the data when a segment rolls, which can take hours.
>>> > This
>>> > > > >> causes
>>> > > > >> >> >> three
>>> > > > >> >> >> >> >> problems. (1) Much more metadata needs to be stored
>>> and
>>> > > > >> >> maintained,
>>> > > > >> >> >> >> which
>>> > > > >> >> >> >> >> increases the cost. Suppose that each segment rolls
>>> > every
>>> > > 5
>>> > > > >> >> hours,
>>> > > > >> >> >> each
>>> > > > >> >> >> >> >> partition generates 2 WAL files per second and each
>>> WAL
>>> > > file's
>>> > > > >> >> >> metadata
>>> > > > >> >> >> >> >> takes 100 bytes. Each partition will generate 5 *
>>> 3.6K *
>>> > > 2 *
>>> > > > >> 100
>>> > > > >> >> =
>>> > > > >> >> >> >> 3.6MB
>>> > > > >> >> >> >> >> of
>>> > > > >> >> >> >> >> metadata. In a cluster with 100K partitions, this
>>> > > translates
>>> > > > >> to
>>> > > > >> >> >> 360GB
>>> > > > >> >> >> >> of
>>> > > > >> >> >> >> >> metadata stored on the diskless coordinators. (2) A
>>> > > > >> catching-up
>>> > > > >> >> >> >> consumer's
>>> > > > >> >> >> >> >> performance degrades since it's forced to read data
>>> from
>>> > > many
>>> > > > >> >> small
>>> > > > >> >> >> WAL
>>> > > > >> >> >> >> >> files. (3) The data in WAL fi
>>
>>

Re: [DISCUSS] KIP-1163: Diskless Core

Reply via email to