Hi Jun,

JR1
(1)-(2)-(3) I'd address these together and let me explain our current idea
to solve the tiny object problem because I'm not sure if we're 100% talking
about the same thing. I have two approaches in mind for TS consolidation
((A) and (B)) and I'm not sure if we're both assuming the same idea, so
let's clarify this.

(A)
This is our current assumption. This uses local disks (create classic local
logs with UnifiedLog) to consolidate logs into the classic log format and
use RSM and RLMM to store them in tiered storage. This way we're not
limited by the need to have short rollovers. Local logs become a form of
staging environment to serve reads and accumulate records for tiered
storage. This means that:
 (a) Once a message is consolidated into the classic log format, we can use
it for serving lagging consumers. Diskless reads should really be used for
the head of the log and after a few seconds logs should be consolidated.
 (b) The real cost is much closer to that 87.5% (and in fact my google
sheet I shared also assumes this model) because we have more freedom in
choosing the retention parameters of the classic log.
 (c) Metadata is smaller as we only need to keep diskless segments until
the tiered offset surpasses the individual batches' offset.
 (d) RLMM metadata is also somewhat manageable due to the larger segment
sizes but it's still possible to run into the metadata explosion problem.
 (e) It needs to rebuild this local log on reassignment to serve lagging
consumers effectively, so reassignment is a bit more messy.
 (f) It's not optimal when partitions have a single replica: on failure we
can only fall back to diskless mode until the partition is reassigned to a
functioning broker.

(B)
Compared to the above there can be an alternative approach, which is to
consolidate when diskless segments expire (after 15 minutes for instance).
In that case your points seem to fit better as:
 (a) we can only use the classic, consolidated logs to serve lagging
consumers after they have been tiered
 (b) to be more efficient with lagging consumers we have to stick to a
short rollover
 (c) it's more costly due to the short rollovers
 (d) the RLMM bottleneck still exists due to the short rollovers
 (e) it's not given whether we use local disks for transforming logs as we
can do it in memory too (which can be ineffective and more expensive) but
perhaps a “chunked transfer encoding” that S3 supports or similar with
other providers is a cost effective way. If we know the final size advance,
we can upload data in chunks and still get billed for 1 put.
 (f) more efficient reassignment or failover is cleaner and faster as there
isn't a need to rebuild local caches.

(C)
Apart from the first 2 approaches there is a 3rd, which is WAL merging. To
understand your points, let me summarize that I could gather so far as
reasons for WAL merging (and please correct me if I missed something):
 (i) protecting consumer lag: small WAL files create inefficient objects
for lagging consumers, so larger objects should be more efficient
 (ii) avoiding the RLMM replay bottleneck: managing small segments with
RLMM is very inefficient (100s of GB metadata)
 (iii) reducing batch metadata overhead: merging WAL files may reduce the
metadata we need to store, but it depends on the merge algorithm and how we
can compact batch data
 (iv) cost effectiveness: retrieving merged WAL files reduces the number of
get requests to object storage
 (v) architectural redundancy with RLMM: ideally we wouldn't need 2
solutions to 2 somewhat similar problems (tiered storage and diskless)

Generally I think that (i) (ii) (iii) and (iv) may be addressed with an
aggressive tiered storage consolidation (the first approach), so the only
remaining gap would be (v). I also agree that having 2 different solutions
for metadata handling isn't ideal and perhaps there is a possibility of
improvement here. It should be possible to redesign RLMM to be more similar
to the diskless coordinator or design a common solution.

JR11
"If we support merging in the diskless coordinator, I wonder how useful RLMM
is. It seems simpler to manage all metadata from the object store in a
single place."

Could you please clarify this a little bit? Do you think that we should
replace the RLMM with a solution that is more similar to the diskless
coordinator or deprecate tiered storage altogether in favor of diskless?
I'm not sure which option you're referring:
 (1) Unify tiered storage and diskless under a single storage layer (and
possibly deprecate tiered storage in favor of diskless with merging WAL
segments).
 (2) Create a smart coordinator instead of RLMM and possibly unify metadata
coordination with diskless.
 (3) Keep tiered storage and diskless separate with their own solutions for
metadata (probably not optimal).

Thanks,
Viktor

On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <[email protected]> wrote:

> Hi, Viktor and Greg,
>
> Thanks for the reply.
>
> JR1.
> 1) Thanks for verifying the cost estimation. I noticed a bug in my earlier
> calculation. I estimated the per broker network transfer rate at 2MB/sec.
> It should be 4MB/sec. If I correct it, the estimated savings are similar to
> yours.
> The cost for transferring 4MB through the network is 4 * 2 * 10^-5 = $8*
> 10^-5
> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings are
> about 87.5%.
> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings are
> 62.5%.
> Savings are still significantly lower when using RLMM.
>
> "To me it seems like that Greg's previous suggestion for a 15 min rollover
> may be a bit too much. With 1 hour we can achieve better cost saving and
> less coordinate metadata being stored."
> This solves the cost issue, but it has other implications (see point 2)
> below).
>
> 2) "Yes, I think this is to be expected and a lot depends on the
> implementation. Ideally segments or chunks should be cached to minimize the
> number of times segments pulled from remote storage."
> In a classic topic, when a consumer lags, its requests are served either
> from the local cache or from large objects in the object store. With the
> current design in a diskless topic, lagging consumer requests might be
> served from tiny 500-byte objects. This will significantly slow down the
> consumer's catch-up, which is not expected user behavior. Ideally, we don't
> want those tiny objects to last more than a few minutes, let alone an hour.
>
> 3) "I think if my calculations are correct (and we use a 60 minute window),
> then metadata generation should be slower, please see the google sheet I
> linked above. I think given that traffic, the current topic based RLMM
> should be able to handle it."
> Why is a 60 minute window used? RLMM metadata needs to be retained for the
> longest retention time among all topics. This means that the retention
> window can be weeks instead of 1 hour. This means that RLMM might need to
> replay over 100GB of data during reassignment, which is not what it is
> designed for.
>
> JR10. "Your example of 100,000 1kb/s partitions is a borderline case, where
> there are some configurations which are not viable due to scale or cost,
> and some that are. It would be up to the operator to tune their cluster, by
> changing diskless.segment.ms
> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >,
> dividing up the cluster, or switching to a more scalable RLMM
> implementation."
> A broker with 4MB/sec produce throughput can probably be considered high
> throughput. Even with 4K partitions per broker, we could still achieve an
> 87.5% cost saving as listed above, if we do the right implementation. So,
> ideally, it would be useful to support that as well.
>
> JR11. "We had a short conversation with Greg and we came to the conclusion
> that because of the explosiveness of diskless metadata, it may be worth
> revisiting the merging case as it can indeed buy us some more cost saving
> for the added complexity. "
> If we support merging in the diskless coordinator, I wonder how useful RLMM
> is. It seems simpler to manage all metadata from the object store in a
> single place.
>
> Jun
>
> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]> wrote:
>
> > Hi Jun,
> >
> > Thank you for scrutinizing the scalability of the current
> > direct-to-tiered-storage strategy, and its metadata scalability.
> >
> > One of our implicit assumptions with this design was that users are able
> > to choose between the Diskless and Classic mechanisms, and that any
> > situations where the Diskless design was deficient, the Classic topics
> > could continue to be used.
> > This was originally applied to low-latency use-cases, but now also
> applies
> > to low-throughput use-cases too. When the throughput on a topic is low,
> the
> > benefit of using Diskless is also low, because it is proportional to the
> > amount of data transferred, and it is more likely that the batch overhead
> > of the topics is significant.
> > In other words, we've been treating cost-effective support for
> arbitrarily
> > low throughput topics as a non-goal.
> >
> > Your example of 100,000 1kb/s partitions is a borderline case, where
> there
> > are some configurations which are not viable due to scale or cost, and
> some
> > that are. It would be up to the operator to tune their cluster, by
> changing
> > diskless.segment.ms
> > <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >,
> > dividing up the cluster, or switching to a more scalable RLMM
> > implementation.
> >
> > Do you think we should have cost-effective support for arbitrarily
> > low-throughput partitions in Diskless? How much total demand is there in
> > partitions where batches are >1kb but the partition throughput is <1kb/s?
> >
> > Thanks,
> > Greg
> >
> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass <[email protected]>
> > wrote:
> >
> >> Hi Jun,
> >>
> >> Regarding JR1.
> >> We had a short conversation with Greg and we came to the conclusion that
> >> because of the explosiveness of diskless metadata, it may be worth
> >> revisiting the merging case as it can indeed buy us some more cost
> saving
> >> for the added complexity. Also, it would support smaller topics and we
> >> could somewhat manage the tiered storage consolidation costs. I think
> that
> >> we would still need to consolidate WAL segments into tiered storage.
> >> Reasons are: to limit WAL metadata, to be able to dynamically
> >> enable/disable diskless and to be compatible with existing and future TS
> >> improvements.
> >> I'll try to refresh KIP-1165 and build it into the calculator above (if
> >> it's possible at all :) ) and come back to you.
> >> Regardless, I just wanted to give a short update in the meantime,
> looking
> >> forward to your answer.
> >>
> >> Best,
> >> Viktor
> >>
> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass <
> >> [email protected]>
> >> wrote:
> >>
> >> > Hi Jun,
> >> >
> >> > Thanks for the quick reply.
> >> >
> >> > JR1.
> >> > 1) Thanks for putting the numbers together. While your calculation
> >> > seems to be correct in the sense that 6 PUTs would worsen the cost
> >> saving
> >> > benefits, I think that in a byte for byte comparison there is a bigger
> >> > difference. The reason is that the 4 tiered storage puts transfer much
> >> more
> >> > data compared to the small WAL segments, so in practice there should
> be
> >> > fewer TS puts.
> >> > I made a google sheet calculator for this which I'd like to share with
> >> > you:
> >> >
> >>
> https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906
> >> <
> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$
> >
> >> > Please copy the sheet to modify the values.
> >> > About my findings: I was trying to create a similar cluster model that
> >> has
> >> > been discussed here previously to see how cost varies over different
> >> > segment rollovers.To me it seems like that Greg's previous suggestion
> >> for a
> >> > 15 min rollover may be a bit too much. With 1 hour we can achieve
> better
> >> > cost saving and less coordinate metadata being stored. I have also
> >> tried to
> >> > account for the producer batch metadata generated by diskless
> partitions
> >> > but to me it seems like a lower number than Greg's original numbers.
> >> >
> >> > 2) "Note that local storage could be lost on reassigned partitions. In
> >> > that case, lagging reads can only be served from the object store."
> >> > Yes, I think this is to be expected and a lot depends on the
> >> > implementation. Ideally segments or chunks should be cached to
> minimize
> >> the
> >> > number of times segments pulled from remote storage.
> >> >
> >> > "The 2MB/sec I quoted is for a specific broker. Depending on the
> broker
> >> > instance type, a broker may only be able to handle low 10s of MB/sec
> of
> >> > data. So, 2MB/sec overhead is significant."
> >> > Yes, I have indeed misunderstood, however I have updated my calculator
> >> > sheet with metadata calculation. Overall, the number of tiered storage
> >> > segments created seems to be much lower than in your calculations
> given
> >> the
> >> > parameters of the cluster you specified earlier. Please take a look,
> I'd
> >> > like to really understand the thinking here because this is a crucial
> >> point.
> >> >
> >> > 3) I think if my calculations are correct (and we use a 60 minute
> >> window),
> >> > then metadata generation should be slower, please see the google
> sheet I
> >> > linked above. I think given that traffic, the current topic based RLMM
> >> > should be able to handle it.
> >> > In the case where we would need to make the RLMM capable of handling a
> >> > similar traffic as the diskless coordinator, then you're right, we
> >> probably
> >> > should consider how we can improve it. I think there are multiple
> >> > possibilities as you mentioned, but ideally there should be a common
> >> > implementation for metadata coordination that could handle these
> cases.
> >> >
> >> > JR7.
> >> > Yes, your expectation is totally reasonable, we should expect the get
> >> and
> >> > put operations to be strongly consistent for the read-after-write
> >> > scenarios. And I think that since major cloud providers give strongly
> >> > consistent object storages, it should be sufficient for a wide
> >> user-group.
> >> > So we could shrink the scope of the KIP a bit this way and avoid
> adding
> >> > complexity that is needed mostly on the margin.
> >> > I can expect though that "list" can stay eventually consistent as the
> >> KIP
> >> > relies on it for only garbage collection where it is fine if a few
> >> segments
> >> > can be collected only in the next iteration.
> >> >
> >> > JR3.
> >> > Since Greg hasn't replied yet, I'll try to catch up with him and
> >> formulate
> >> > an answer next week.
> >> >
> >> > Best,
> >> > Viktor
> >> >
> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev <[email protected]
> >
> >> > wrote:
> >> >
> >> >> Hi, Victor,
> >> >>
> >> >> Thanks for the reply.
> >> >>
> >> >> JR1.
> >> >> 1)  "So while it seems to be significant that we tripled the number
> of
> >> >> PUTs, cost-wise it doesn't seem to be significant."
> >> >> Let's compare the savings achieved by replacing network replication
> >> >> transfer with S3 puts in AWS.
> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB
> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request
> >> >>
> >> >> The KIP batches data up to 4MB. So, let's assume that we write 2MB S3
> >> >> objects on average.
> >> >>
> >> >> The cost for transferring 2MB through the network is 2 * 2 * 10^-5 =
> >> $4*
> >> >> 10^-5
> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings
> >> are
> >> >> about 75%.
> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings
> >> are
> >> >> 25%. As you can see, the savings are significantly lower.
> >> >>
> >> >> 2) "Therefore we could expect classic local segments to be present
> >> which
> >> >> could be used for catching up consumers."
> >> >> Note that local storage could be lost on reassigned partitions. In
> that
> >> >> case, lagging reads can only be served from the object store.
> >> >>
> >> >> "Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
> >> >> throughput that Greg calculated previously, so I think it should be
> >> >> manageable for a cluster with that amount of throughput,"
> >> >> It seems that you didn't make the correct comparison. 2GB/s that Greg
> >> >> mentioned is the throughput for the whole cluster. The 2MB/sec I
> >> quoted is
> >> >> for a specific broker. Depending on the broker instance type, a
> broker
> >> may
> >> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec
> overhead
> >> is
> >> >> significant.
> >> >>
> >> >> 3) "I'd separate it from the discussion of diskless core and perhaps
> we
> >> >> could address it in a separate KIP as it is mostly a redesign of the
> >> >> RLMM."
> >> >> Those problems don't exist in the existing usage of RLMM. They
> manifest
> >> >> because diskless tries to use RLMM in a way it wasn't designed for
> >> (there
> >> >> is at least a 20X increase in metadata). It would be useful to
> consider
> >> >> whether fixing those problems in RLMM or using a new approach is
> >> >> better. For example, KIP-1164 already introduces a snapshotting
> >> mechanism.
> >> >> Adding another snapshotting mechanism to RLMM seems redundant.
> >> >>
> >> >> JR7. A typical object store supports 3 operations: puts, gets and
> >> lists.
> >> >> Which operations used by diskless can be eventually consistent? I'd
> >> expect
> >> >> that get should always see the result of the latest put.
> >> >>
> >> >> Jun
> >> >>
> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass <
> [email protected]
> >> >
> >> >> wrote:
> >> >>
> >> >> > Hi Jun,
> >> >> >
> >> >> > I'd like to add my thoughts too until Greg has time to respond.
> >> >> >
> >> >> > JR1. I also think there are shortcomings in the current tiered
> >> storage
> >> >> > design, around the RLMM.
> >> >> > 1) I think this is a correct observation, however if my
> calculations
> >> are
> >> >> > correct, it actually comes down to a negligible amount of cost.
> >> Taking
> >> >> the
> >> >> > AWS pricing sheet at
> >> >> >
> >> >>
> >>
> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps
> >> <
> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$
> >
> >> >> > it seems like the difference between 6 or 2 PUTs per second is ~$52
> >> for
> >> >> a
> >> >> > month. The calculation follows
> >> >> > as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. So
> >> while
> >> >> it
> >> >> > seems to be significant that we tripled the number of PUTs,
> >> cost-wise it
> >> >> > doesn't seem to be significant.
> >> >> > 2) Reflecting to your original problem: the tiered storage
> >> consolidation
> >> >> > process should be continuously running and transforming WAL
> segments
> >> >> into
> >> >> > classic logs. Therefore we could expect classic local segments to
> be
> >> >> > present which could be used for catching up consumers. So they
> would
> >> >> only
> >> >> > switch to WAL reading when they're close to the end of the log.
> Since
> >> >> this
> >> >> > offset space should be cached, the reads from there should be fast.
> >> >> > Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
> >> >> > throughput that Greg calculated previously, so I think it should be
> >> >> > manageable for a cluster with that amount of throughput, although I
> >> >> agree
> >> >> > with your comment that the current topic based tiered metadata
> >> manager
> >> >> > isn't optimal and we could develop a better solution.
> >> >> > 3) Tied to the previous point, I agree that your comments are
> >> absolutely
> >> >> > valid, however similarly to that, I'd separate it from the
> >> discussion of
> >> >> > diskless core and perhaps we could address it in a separate KIP as
> >> it is
> >> >> > mostly a redesign of the RLMM.
> >> >> >
> >> >> > JR2. Ack. We will raise a KIP in the near future.
> >> >> >
> >> >> > JR3. I'd leave answering this to Greg as I don't have too much
> >> context
> >> >> on
> >> >> > this one.
> >> >> >
> >> >> > JR7. I think this could be similar to the tiered storage design, so
> >> any
> >> >> > coordinator operation should be strongly consistent (since we're
> >> using
> >> >> > classic topics there). Therefore the WAL segment storage layer
> could
> >> be
> >> >> > eventually consistent as we store its metadata in a strongly
> >> consistent
> >> >> > manner. I'm not sure though if this was the answer you're looking
> >> for?
> >> >> >
> >> >> > Best,
> >> >> > Viktor
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev <
> >> [email protected]>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi, Greg,
> >> >> >>
> >> >> >> Thanks for the reply.
> >> >> >>
> >> >> >> JR1. Rolling log segments every 15 minutes addresses the 3
> concerns
> >> I
> >> >> >> listed, but it introduces some new issues because it doesn't quite
> >> fit
> >> >> the
> >> >> >> design of the current tiered storage. (a) The current tiered
> storage
> >> >> >> design
> >> >> >> stores a single partition per object. If we roll a log segment
> >> every 15
> >> >> >> minutes, with 4K partitions per broker, this means an additional 4
> >> S3
> >> >> puts
> >> >> >> per second. The diskless design aims for 2 S3 puts per second. So,
> >> this
> >> >> >> triples the S3 put cost and reduces the savings benefits. (b) With
> >> Tier
> >> >> >> storage, each broker essentially needs to read the tier metadata
> >> from
> >> >> all
> >> >> >> tier metadata partitions if the number of user partitions exceeds
> >> 50.
> >> >> >> Assuming that we generate 100 bytes of tier metadata per partition
> >> >> every
> >> >> >> 15
> >> >> >> minutes. Assuming that each broker has 4K partitions and a cluster
> >> of
> >> >> 500
> >> >> >> brokers. Each broker needs to receive tier metadata at a rate of
> >> 100 *
> >> >> 4K
> >> >> >> *
> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the 50
> tier
> >> >> >> metadata topic partitions, it needs to send out metadata at 100 *
> >> 4K *
> >> >> 500
> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary
> network
> >> >> and
> >> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A
> >> restarted
> >> >> >> broker needs to replay the tier metadata log from the beginning to
> >> >> build
> >> >> >> the tier metadata state. Suppose that the tier metadata log is
> kept
> >> >> for 7
> >> >> >> days. The total amount of tier metadata that needs to be replayed
> is
> >> >> 200KB
> >> >> >> * 7 * 24 * 3600 = 120GB.
> >> >> >> Does the merging optimization you mentioned address those new
> >> >> concerns? If
> >> >> >> so, could you describe how it works?
> >> >> >>
> >> >> >> JR2. It's fine to cover the default partition assignment strategy
> >> for
> >> >> >> diskless topics in a separate KIP. However, since this is
> essential
> >> for
> >> >> >> achieving the cost saving goal, we need a solution before
> releasing
> >> the
> >> >> >> diskless KIP.
> >> >> >>
> >> >> >> JR3. Sounds good. Could you document how this work?
> >> >> >>
> >> >> >> JR7. Could you describe which parts of the operation can be
> >> eventually
> >> >> >> consistent?
> >> >> >>
> >> >> >> Jun
> >> >> >>
> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris <[email protected]
> >
> >> >> wrote:
> >> >> >>
> >> >> >> > Hi Jun,
> >> >> >> >
> >> >> >> > Thanks for your comments!
> >> >> >> >
> >> >> >> > JR1:
> >> >> >> > You are correct that the segment rolling configurations are
> >> currently
> >> >> >> > critical to balance the scalability of Diskless and Tiered
> >> Storage,
> >> >> as
> >> >> >> > larger roll configurations benefit tiered storage, and smaller
> >> roll
> >> >> >> > configurations benefit Diskless.
> >> >> >> >
> >> >> >> > To address your points specifically:
> >> >> >> > (1) A Diskless topic which is cost-competitive with an
> equivalent
> >> >> >> Classic
> >> >> >> > topic will have a metadata size <1% of the data size. A cluster
> >> >> storing
> >> >> >> > 360GB of metadata will have >36TB of data under management and a
> >> >> >> retention
> >> >> >> > of 5hr implies a throughput of >2GB/s. This will require
> multiple
> >> >> >> Diskless
> >> >> >> > coordinators, which can share the load of storing the Diskless
> >> >> metadata,
> >> >> >> > and serving Diskless requests.
> >> >> >> > (2) Catching up consumers are intended to be served from tiered
> >> >> storage
> >> >> >> > and local segment caches. Brokers which are building their local
> >> >> segment
> >> >> >> > caches will have to read many files, but will amortize those
> >> reads by
> >> >> >> > receiving data for multiple partitions in a single read.
> >> >> >> > (3) This is a fundamental downside of storing data from multiple
> >> >> topics
> >> >> >> in
> >> >> >> > a single object, similar to classic segments. We can implement a
> >> >> >> > configurable cluster-wide maximum roll time, which would set the
> >> >> slowest
> >> >> >> > cadence at which Tiered Storage segments are rolled from
> Diskless
> >> >> >> segments.
> >> >> >> > If an individual partition has more aggressive roll settings, it
> >> may
> >> >> be
> >> >> >> > rolled earlier.
> >> >> >> > This configuration would permit the cluster operator to
> >> approximately
> >> >> >> > bound the number of diskless WAL segments, which bounds the
> total
> >> >> size
> >> >> >> of
> >> >> >> > the WAL segments, disk cache, diskless coordinator state, and
> >> >> excessive
> >> >> >> > retention window. For example, a diskless.segment.ms
> >> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >
> >> of 15 minutes
> >> >> >> would
> >> >> >> > reduce the metadata storage to 18GB, WAL segments to 1.8TB, and
> >> >> permit
> >> >> >> > short-retention data to be physically deleted as soon as ~15
> >> minutes
> >> >> >> after
> >> >> >> > being produced.
> >> >> >> > Of course, this will reduce the size of the tiered storage
> >> segments
> >> >> for
> >> >> >> > topics that have low throughput, and where segment.ms
> >> <
> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
> >
> >> >
> >> >> >> > diskless.segment.ms
> >> <
> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >,
> >> increasing overhead in the RLMM. We can perform
> >> >> >> > merging/optimization of Tiered Storage segments to achieve the
> >> >> per-topic
> >> >> >> > segment.ms
> >> <
> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
> >
> >> .
> >> >> >> > There were some reasons why we retracted the prior file-merging
> >> >> >> approach,
> >> >> >> > and why merging in tiered storage appears better:
> >> >> >> > * Rewriting files requires mutability for existing data, which
> >> adds
> >> >> >> > complexity. Diskless batches or Remote Log Segments would need
> to
> >> be
> >> >> >> made
> >> >> >> > mutable, and the remote log will be made mutable in KIP-1272 [1]
> >> >> >> > * Because a WAL Segment can contain batches from multiple
> Diskless
> >> >> >> > Coordinators, multiple coordinators must also be involved in the
> >> >> merging
> >> >> >> > step. The Tiered Storage design has exclusive ownership for
> remote
> >> >> log
> >> >> >> > segments within the RLMM.
> >> >> >> > * Diskless file merging competes for resources with
> >> latency-sensitive
> >> >> >> > producers and hot consumers. Tiered storage file merging
> competes
> >> for
> >> >> >> > resources with lagging consumers, which are typically less
> latency
> >> >> >> > sensitive.
> >> >> >> > * Implementing merging in Tiered Storage allows this
> optimization
> >> to
> >> >> >> > benefit both classic topics and diskless topics, covering both
> >> high
> >> >> and
> >> >> >> low
> >> >> >> > throughput partitions.
> >> >> >> > * Remote log segments may be optimized over much longer time
> >> windows
> >> >> >> > rather than performing optimization once in the first few hours
> of
> >> >> the
> >> >> >> life
> >> >> >> > of a WAL segment and then freezing the arrangement of the data
> >> until
> >> >> it
> >> >> >> is
> >> >> >> > deleted.
> >> >> >> > * File merging will need to rely on heuristics, which should be
> >> >> >> > configurable by the user. Multi-partition heuristics are more
> >> >> >> complicated
> >> >> >> > to describe and reason about than single-partition heuristics.
> >> >> >> > What do you think of this alternative?
> >> >> >> >
> >> >> >> > JR2:
> >> >> >> > Yes, the current default partition assignment strategy will need
> >> some
> >> >> >> > improvement. This problem with Diskless WAL segments is
> analogous
> >> to
> >> >> the
> >> >> >> > Classic topics’ dense inter-broker connection graph.
> >> >> >> > The natural solution to this seems to be some sort of cellular
> >> >> design,
> >> >> >> > where the replica placements tend to locate partitions in
> similar
> >> >> >> groups.
> >> >> >> > Partitions in the same cell can generally share the same WAL
> >> Segments
> >> >> >> and
> >> >> >> > the same Diskless Coordinator requests. This would also benefit
> >> >> Classic
> >> >> >> > topics, which would need fewer connections and fetch requests.
> >> >> >> > Such a feature is out-of-scope of this KIP, and either we will
> >> >> publish a
> >> >> >> > follow-up KIP, or let operators and community tooling address
> >> this.
> >> >> >> >
> >> >> >> > JR3:
> >> >> >> > Yes we will replace the ISR/ELR election logic for diskless
> >> topics,
> >> >> as
> >> >> >> > they no longer rely on replicas for data integrity. We will
> fully
> >> >> model
> >> >> >> the
> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and choose
> how
> >> we
> >> >> >> > display this to clients.
> >> >> >> > For backwards compatibility, clients using older metadata
> requests
> >> >> >> should
> >> >> >> > see diskless topics, but interpret them as classic topics. We
> >> could
> >> >> tell
> >> >> >> > older clients that the leader is in the ISR, even if it just
> >> started
> >> >> >> > building its cache.
> >> >> >> > For clients using the latest metadata, they should see the true
> >> >> state of
> >> >> >> > the diskless partition: which nodes can accept
> >> >> produce/fetch/sharefetch
> >> >> >> > requests, which ranges of offsets are cached on-broker, etc.
> This
> >> >> could
> >> >> >> > also be used to break apart the “leader” field into more
> granular
> >> >> >> fields,
> >> >> >> > now that leadership has changed meaning.
> >> >> >> >
> >> >> >> > JR4:
> >> >> >> > Yes, we can replace the empty fetch requests to the leader nodes
> >> with
> >> >> >> > cache hint fields in the requests to the Diskless Coordinator,
> and
> >> >> rely
> >> >> >> on
> >> >> >> > the coordinator to distribute cache hints to all replicas. This
> >> >> should
> >> >> >> be
> >> >> >> > low-overhead, and eliminate the inter-broker communication for
> >> >> brokers
> >> >> >> > which only host Diskless topics.
> >> >> >> >
> >> >> >> > JR5.1:
> >> >> >> > You are correct and this text was ambiguous, only specifying
> that
> >> the
> >> >> >> > controller waits for the sync to be complete. This section is
> now
> >> >> >> updated
> >> >> >> > to explicitly say that local segments are built from object
> >> storage.
> >> >> >> >
> >> >> >> > JR5.2:
> >> >> >> > Extending the JR2 discussion, reassignment of diskless topics
> >> would
> >> >> >> > generally happen within a cell, where the marginal cost of
> >> reading an
> >> >> >> > additional partition is very low. When cells are re-balanced
> and a
> >> >> >> > partition is migrated between cells, there is a brief time
> (until
> >> the
> >> >> >> next
> >> >> >> > Tiered Storage segment roll) when the marginal cost is doubled.
> >> This
> >> >> >> should
> >> >> >> > be infrequent and well-amortized by other topics which aren’t
> >> being
> >> >> >> > re-balanced between cells.
> >> >> >> >
> >> >> >> > JR6.1:
> >> >> >> > We plan to move data from Diskless to Tiered Storage. Once the
> >> data
> >> >> is
> >> >> >> in
> >> >> >> > Tiered Storage, it can be compacted using the functionality
> >> >> described in
> >> >> >> > KIP-1272 [1]
> >> >> >> >
> >> >> >> > JR6.2:
> >> >> >> > We will add details for this soon.
> >> >> >> >
> >> >> >> > JR7:
> >> >> >> > We specify the requirement of eventual consistency to allow
> >> Diskless
> >> >> >> > Topics to be used with other object storage implementations
> which
> >> >> aren’t
> >> >> >> > the three major public clouds, such as self-managed software or
> >> >> weaker
> >> >> >> > consistency caches.
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Greg
> >> >> >> >
> >> >> >> > [1]
> >> >> >> >
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage
> >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$
> >
> >> >> >> >
> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <
> >> [email protected]
> >> >> >
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Hi, Ivan,
> >> >> >> >>
> >> >> >> >> Thanks for the KIP. A few comments below.
> >> >> >> >>
> >> >> >> >> JR1. I am concerned about the usage of the current tiered
> >> storage to
> >> >> >> >> control the number of small WAL files. Current tiered storage
> >> only
> >> >> >> tiers
> >> >> >> >> the data when a segment rolls, which can take hours. This
> causes
> >> >> three
> >> >> >> >> problems. (1) Much more metadata needs to be stored and
> >> maintained,
> >> >> >> which
> >> >> >> >> increases the cost. Suppose that each segment rolls every 5
> >> hours,
> >> >> each
> >> >> >> >> partition generates 2 WAL files per second and each WAL file's
> >> >> metadata
> >> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K * 2 *
> 100
> >> =
> >> >> >> 3.6MB
> >> >> >> >> of
> >> >> >> >> metadata. In a cluster with 100K partitions, this translates to
> >> >> 360GB
> >> >> >> of
> >> >> >> >> metadata stored on the diskless coordinators. (2) A catching-up
> >> >> >> consumer's
> >> >> >> >> performance degrades since it's forced to read data from many
> >> small
> >> >> WAL
> >> >> >> >> files. (3) The data in WAL files could be retained much longer
> >> than
> >> >> >> >> retention time. Since the small WAL files aren't completely
> >> deleted
> >> >> >> until
> >> >> >> >> all partitions' data in it are obsolete, the deletion of the
> WAL
> >> >> files
> >> >> >> >> could be delayed by hours or more. If the WAL file includes a
> >> >> partition
> >> >> >> >> with a low retention time, the retention contract could be
> >> violated
> >> >> >> >> significantly. The earlier design of the KIP included a
> separate
> >> >> object
> >> >> >> >> merging process that combines small WAL files much more
> >> aggressively
> >> >> >> than
> >> >> >> >> tiered storage, which seems to be a much better choice.
> >> >> >> >>
> >> >> >> >> JR2. I don't think the current default partition assignment
> >> strategy
> >> >> >> for
> >> >> >> >> classic topics works for diskless topics. Current strategy
> tries
> >> to
> >> >> >> spread
> >> >> >> >> the replicas to as many brokers as possible. For example, if a
> >> >> broker
> >> >> >> has
> >> >> >> >> 100 partitions, their replicas could be spread over 100
> brokers.
> >> If
> >> >> the
> >> >> >> >> broker generates a WAL file with 100 partitions, this WAL file
> >> will
> >> >> be
> >> >> >> >> read
> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of the
> cost
> >> of
> >> >> S3
> >> >> >> >> put.
> >> >> >> >> This assignment strategy will increase the S3 cost by about 8X,
> >> >> which
> >> >> >> is
> >> >> >> >> prohibitive. We need to design a cost effective assignment
> >> strategy
> >> >> for
> >> >> >> >> diskless topics.
> >> >> >> >>
> >> >> >> >> JR3. We need to think through the leade election logic with
> >> diskless
> >> >> >> >> topic.
> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, but it
> >> doesn't
> >> >> >> seem
> >> >> >> >> very natural.
> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In the
> >> >> diskless
> >> >> >> >> topic, the KIP says that a leader could be out of sync.
> >> >> >> >> JR3.2 The existing leader election logic based on ISR/ELR
> mainly
> >> >> >> retries
> >> >> >> >> to
> >> >> >> >> preserve previously acknowledged data. With diskless topics,
> >> since
> >> >> the
> >> >> >> >> object store provides durability, this logic seems no longer
> >> needed.
> >> >> >> The
> >> >> >> >> existing min.isr and unclean leader election logic also don't
> >> apply.
> >> >> >> >>
> >> >> >> >> JR4. "Despite that there is no inter-broker replication,
> replicas
> >> >> will
> >> >> >> >> still issue FetchRequest to leaders. Leaders will respond with
> >> empty
> >> >> >> (no
> >> >> >> >> records) FetchResponse."
> >> >> >> >> This seems unnatural. Could we avoid issuing inter broker fetch
> >> >> >> requests
> >> >> >> >> for diskless topics?
> >> >> >> >>
> >> >> >> >> JR5. "The replica reassignment will follow the same flow as in
> >> >> classic
> >> >> >> >> topic:".
> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response is alway
> >> >> empty,
> >> >> >> it
> >> >> >> >> doesn't seem the current reassignment flow works for diskless
> >> topic.
> >> >> >> Also,
> >> >> >> >> since the source of the data is object store, it seems more
> >> natural
> >> >> >> for a
> >> >> >> >> replica to back fill the data from the object store, instead of
> >> >> other
> >> >> >> >> replicas. This will also incur lower costs.
> >> >> >> >> JR5.2 How do we prevent reassignment on diskless topics from
> >> causing
> >> >> >> the
> >> >> >> >> same cost issue described in JR2?
> >> >> >> >>
> >> >> >> >> JR6." In other functional aspects, diskless topics are
> >> >> >> indistinguishable
> >> >> >> >> from classic topics. This includes durability guarantees,
> >> ordering
> >> >> >> >> guarantees, transactional and non-transactional producer API,
> >> >> consumer
> >> >> >> >> API,
> >> >> >> >> consumer groups, share groups, data retention (deletion &
> >> compact),"
> >> >> >> >> JR6.1 Could you describe how compact diskless topics are
> >> supported?
> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the transactional
> >> >> >> support in
> >> >> >> >> detail.
> >> >> >> >>
> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and
> >> eventually
> >> >> >> >> consistent storage supporting arbitrary sized byte values and a
> >> >> minimal
> >> >> >> >> set
> >> >> >> >> of atomic operations: put, delete, list, and ranged get."
> >> >> >> >> It seems that the object storage in all three major public
> clouds
> >> >> are
> >> >> >> >> strongly consistent.
> >> >> >> >>
> >> >> >> >> Jun
> >> >> >> >>
> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]>
> >> >> wrote:
> >> >> >> >>
> >> >> >> >> > Hi all,
> >> >> >> >> >
> >> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's now
> >> focus on
> >> >> >> the
> >> >> >> >> > technical details presented in this KIP-1163 and also in
> >> KIP-1164:
> >> >> >> >> Diskless
> >> >> >> >> > Coordinator  [1].
> >> >> >> >> >
> >> >> >> >> > Best,
> >> >> >> >> > Ivan
> >> >> >> >> >
> >> >> >> >> > [1]
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
> >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$
> >
> >> >> >> >> >
> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
> >> >> >> >> > > Hi all!
> >> >> >> >> > >
> >> >> >> >> > > We want to start the discussion thread for KIP-1163:
> Diskless
> >> >> Core
> >> >> >> >> [1],
> >> >> >> >> > which is a sub-KIP for KIP-1150 [2].
> >> >> >> >> > >
> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for
> high-level
> >> >> >> >> questions,
> >> >> >> >> > motivation, and general direction of the feature and this
> >> thread
> >> >> for
> >> >> >> >> > particular details of implementation.
> >> >> >> >> > >
> >> >> >> >> > > Best,
> >> >> >> >> > > Ivan
> >> >> >> >> > >
> >> >> >> >> > > [1]
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
> >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$
> >
> >> >> >> >> > > [2]
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> >> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$
> >
> >> >> >> >> > > [3]
> >> >> >> https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
> >> <
> https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$
> >
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Reply via email to