Re: [DISCUSS] KIP-1163: Diskless Core

Satish Duggana Wed, 13 May 2026 03:56:05 -0700

RLMM was not designed for aggressive copying of the latest data to
tiered storage by having small segment rollouts.


+1 to Jun on leaving the existing RLMM for classic topics with tiered
storage and having an efficient metadata management system required
for diskless topics.


On Tue, 12 May 2026 at 23:59, Jun Rao via dev <[email protected]> wrote:
>
> Hi, Victor,
>
> Thanks for the reply.
>
> JR1. (A) and (B) Yes, your summary matches my thinking.
> (C) "Generally I think that (i) (ii) (iii) and (iv) may be addressed with
> an aggressive tiered storage consolidation (the first approach)".
> Hmm, I am confused by the above statement. By "the first approach", do you
> mean aggressive tiering with faster segment rolling through the existing
> RLMM? I don't think the existing RLMM is designed to solve these issues due
> to inefficiencies in cost, metadata propagation and metadata storage as we
> previously discussed.
>
> JR11. I was thinking we leave the existing RLMM as is and continue to use
> it for classic topics. We design a new, more efficient metadata management
> component independent of RLMM. This new component will be the only metadata
> component that diskless topics depend on.
>
> Jun
>
> On Tue, May 12, 2026 at 8:43 AM Viktor Somogyi-Vass <[email protected]>
> wrote:
>
> > Hi Jun,
> >
> > JR1
> > (1)-(2)-(3) I'd address these together and let me explain our current idea
> > to solve the tiny object problem because I'm not sure if we're 100% talking
> > about the same thing. I have two approaches in mind for TS consolidation
> > ((A) and (B)) and I'm not sure if we're both assuming the same idea, so
> > let's clarify this.
> >
> > (A)
> > This is our current assumption. This uses local disks (create classic
> > local logs with UnifiedLog) to consolidate logs into the classic log format
> > and use RSM and RLMM to store them in tiered storage. This way we're not
> > limited by the need to have short rollovers. Local logs become a form of
> > staging environment to serve reads and accumulate records for tiered
> > storage. This means that:
> >  (a) Once a message is consolidated into the classic log format, we can
> > use it for serving lagging consumers. Diskless reads should really be used
> > for the head of the log and after a few seconds logs should be consolidated.
> >  (b) The real cost is much closer to that 87.5% (and in fact my google
> > sheet I shared also assumes this model) because we have more freedom in
> > choosing the retention parameters of the classic log.
> >  (c) Metadata is smaller as we only need to keep diskless segments until
> > the tiered offset surpasses the individual batches' offset.
> >  (d) RLMM metadata is also somewhat manageable due to the larger segment
> > sizes but it's still possible to run into the metadata explosion problem.
> >  (e) It needs to rebuild this local log on reassignment to serve lagging
> > consumers effectively, so reassignment is a bit more messy.
> >  (f) It's not optimal when partitions have a single replica: on failure we
> > can only fall back to diskless mode until the partition is reassigned to a
> > functioning broker.
> >
> > (B)
> > Compared to the above there can be an alternative approach, which is to
> > consolidate when diskless segments expire (after 15 minutes for instance).
> > In that case your points seem to fit better as:
> >  (a) we can only use the classic, consolidated logs to serve lagging
> > consumers after they have been tiered
> >  (b) to be more efficient with lagging consumers we have to stick to a
> > short rollover
> >  (c) it's more costly due to the short rollovers
> >  (d) the RLMM bottleneck still exists due to the short rollovers
> >  (e) it's not given whether we use local disks for transforming logs as we
> > can do it in memory too (which can be ineffective and more expensive) but
> > perhaps a “chunked transfer encoding” that S3 supports or similar with
> > other providers is a cost effective way. If we know the final size advance,
> > we can upload data in chunks and still get billed for 1 put.
> >  (f) more efficient reassignment or failover is cleaner and faster as
> > there isn't a need to rebuild local caches.
> >
> > (C)
> > Apart from the first 2 approaches there is a 3rd, which is WAL merging. To
> > understand your points, let me summarize that I could gather so far as
> > reasons for WAL merging (and please correct me if I missed something):
> >  (i) protecting consumer lag: small WAL files create inefficient objects
> > for lagging consumers, so larger objects should be more efficient
> >  (ii) avoiding the RLMM replay bottleneck: managing small segments with
> > RLMM is very inefficient (100s of GB metadata)
> >  (iii) reducing batch metadata overhead: merging WAL files may reduce the
> > metadata we need to store, but it depends on the merge algorithm and how we
> > can compact batch data
> >  (iv) cost effectiveness: retrieving merged WAL files reduces the number
> > of get requests to object storage
> >  (v) architectural redundancy with RLMM: ideally we wouldn't need 2
> > solutions to 2 somewhat similar problems (tiered storage and diskless)
> >
> > Generally I think that (i) (ii) (iii) and (iv) may be addressed with an
> > aggressive tiered storage consolidation (the first approach), so the only
> > remaining gap would be (v). I also agree that having 2 different solutions
> > for metadata handling isn't ideal and perhaps there is a possibility of
> > improvement here. It should be possible to redesign RLMM to be more similar
> > to the diskless coordinator or design a common solution.
> >
> > JR11
> > "If we support merging in the diskless coordinator, I wonder how useful
> > RLMM
> > is. It seems simpler to manage all metadata from the object store in a
> > single place."
> >
> > Could you please clarify this a little bit? Do you think that we should
> > replace the RLMM with a solution that is more similar to the diskless
> > coordinator or deprecate tiered storage altogether in favor of diskless?
> > I'm not sure which option you're referring:
> >  (1) Unify tiered storage and diskless under a single storage layer (and
> > possibly deprecate tiered storage in favor of diskless with merging WAL
> > segments).
> >  (2) Create a smart coordinator instead of RLMM and possibly unify
> > metadata coordination with diskless.
> >  (3) Keep tiered storage and diskless separate with their own solutions
> > for metadata (probably not optimal).
> >
> > Thanks,
> > Viktor
> >
> > On Fri, May 1, 2026 at 9:08 PM Jun Rao via dev <[email protected]>
> > wrote:
> >
> >> Hi, Viktor and Greg,
> >>
> >> Thanks for the reply.
> >>
> >> JR1.
> >> 1) Thanks for verifying the cost estimation. I noticed a bug in my earlier
> >> calculation. I estimated the per broker network transfer rate at 2MB/sec.
> >> It should be 4MB/sec. If I correct it, the estimated savings are similar
> >> to
> >> yours.
> >> The cost for transferring 4MB through the network is 4 * 2 * 10^-5 = $8*
> >> 10^-5
> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings are
> >> about 87.5%.
> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings are
> >> 62.5%.
> >> Savings are still significantly lower when using RLMM.
> >>
> >> "To me it seems like that Greg's previous suggestion for a 15 min rollover
> >> may be a bit too much. With 1 hour we can achieve better cost saving and
> >> less coordinate metadata being stored."
> >> This solves the cost issue, but it has other implications (see point 2)
> >> below).
> >>
> >> 2) "Yes, I think this is to be expected and a lot depends on the
> >> implementation. Ideally segments or chunks should be cached to minimize
> >> the
> >> number of times segments pulled from remote storage."
> >> In a classic topic, when a consumer lags, its requests are served either
> >> from the local cache or from large objects in the object store. With the
> >> current design in a diskless topic, lagging consumer requests might be
> >> served from tiny 500-byte objects. This will significantly slow down the
> >> consumer's catch-up, which is not expected user behavior. Ideally, we
> >> don't
> >> want those tiny objects to last more than a few minutes, let alone an
> >> hour.
> >>
> >> 3) "I think if my calculations are correct (and we use a 60 minute
> >> window),
> >> then metadata generation should be slower, please see the google sheet I
> >> linked above. I think given that traffic, the current topic based RLMM
> >> should be able to handle it."
> >> Why is a 60 minute window used? RLMM metadata needs to be retained for the
> >> longest retention time among all topics. This means that the retention
> >> window can be weeks instead of 1 hour. This means that RLMM might need to
> >> replay over 100GB of data during reassignment, which is not what it is
> >> designed for.
> >>
> >> JR10. "Your example of 100,000 1kb/s partitions is a borderline case,
> >> where
> >> there are some configurations which are not viable due to scale or cost,
> >> and some that are. It would be up to the operator to tune their cluster,
> >> by
> >> changing diskless.segment.ms
> >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$>
> >> <
> >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >> >,
> >> dividing up the cluster, or switching to a more scalable RLMM
> >> implementation."
> >> A broker with 4MB/sec produce throughput can probably be considered high
> >> throughput. Even with 4K partitions per broker, we could still achieve an
> >> 87.5% cost saving as listed above, if we do the right implementation. So,
> >> ideally, it would be useful to support that as well.
> >>
> >> JR11. "We had a short conversation with Greg and we came to the conclusion
> >> that because of the explosiveness of diskless metadata, it may be worth
> >> revisiting the merging case as it can indeed buy us some more cost saving
> >> for the added complexity. "
> >> If we support merging in the diskless coordinator, I wonder how useful
> >> RLMM
> >> is. It seems simpler to manage all metadata from the object store in a
> >> single place.
> >>
> >> Jun
> >>
> >> On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]> wrote:
> >>
> >> > Hi Jun,
> >> >
> >> > Thank you for scrutinizing the scalability of the current
> >> > direct-to-tiered-storage strategy, and its metadata scalability.
> >> >
> >> > One of our implicit assumptions with this design was that users are able
> >> > to choose between the Diskless and Classic mechanisms, and that any
> >> > situations where the Diskless design was deficient, the Classic topics
> >> > could continue to be used.
> >> > This was originally applied to low-latency use-cases, but now also
> >> applies
> >> > to low-throughput use-cases too. When the throughput on a topic is low,
> >> the
> >> > benefit of using Diskless is also low, because it is proportional to the
> >> > amount of data transferred, and it is more likely that the batch
> >> overhead
> >> > of the topics is significant.
> >> > In other words, we've been treating cost-effective support for
> >> arbitrarily
> >> > low throughput topics as a non-goal.
> >> >
> >> > Your example of 100,000 1kb/s partitions is a borderline case, where
> >> there
> >> > are some configurations which are not viable due to scale or cost, and
> >> some
> >> > that are. It would be up to the operator to tune their cluster, by
> >> changing
> >> > diskless.segment.ms
> >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$>
> >> > <
> >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >> >,
> >> > dividing up the cluster, or switching to a more scalable RLMM
> >> > implementation.
> >> >
> >> > Do you think we should have cost-effective support for arbitrarily
> >> > low-throughput partitions in Diskless? How much total demand is there in
> >> > partitions where batches are >1kb but the partition throughput is
> >> <1kb/s?
> >> >
> >> > Thanks,
> >> > Greg
> >> >
> >> > On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass <[email protected]
> >> >
> >> > wrote:
> >> >
> >> >> Hi Jun,
> >> >>
> >> >> Regarding JR1.
> >> >> We had a short conversation with Greg and we came to the conclusion
> >> that
> >> >> because of the explosiveness of diskless metadata, it may be worth
> >> >> revisiting the merging case as it can indeed buy us some more cost
> >> saving
> >> >> for the added complexity. Also, it would support smaller topics and we
> >> >> could somewhat manage the tiered storage consolidation costs. I think
> >> that
> >> >> we would still need to consolidate WAL segments into tiered storage.
> >> >> Reasons are: to limit WAL metadata, to be able to dynamically
> >> >> enable/disable diskless and to be compatible with existing and future
> >> TS
> >> >> improvements.
> >> >> I'll try to refresh KIP-1165 and build it into the calculator above (if
> >> >> it's possible at all :) ) and come back to you.
> >> >> Regardless, I just wanted to give a short update in the meantime,
> >> looking
> >> >> forward to your answer.
> >> >>
> >> >> Best,
> >> >> Viktor
> >> >>
> >> >> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass <
> >> >> [email protected]>
> >> >> wrote:
> >> >>
> >> >> > Hi Jun,
> >> >> >
> >> >> > Thanks for the quick reply.
> >> >> >
> >> >> > JR1.
> >> >> > 1) Thanks for putting the numbers together. While your calculation
> >> >> > seems to be correct in the sense that 6 PUTs would worsen the cost
> >> >> saving
> >> >> > benefits, I think that in a byte for byte comparison there is a
> >> bigger
> >> >> > difference. The reason is that the 4 tiered storage puts transfer
> >> much
> >> >> more
> >> >> > data compared to the small WAL segments, so in practice there should
> >> be
> >> >> > fewer TS puts.
> >> >> > I made a google sheet calculator for this which I'd like to share
> >> with
> >> >> > you:
> >> >> >
> >> >>
> >> https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906
> >> <https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDHN-4uGY$>
> >> >> <
> >> https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$
> >> >
> >> >> > Please copy the sheet to modify the values.
> >> >> > About my findings: I was trying to create a similar cluster model
> >> that
> >> >> has
> >> >> > been discussed here previously to see how cost varies over different
> >> >> > segment rollovers.To me it seems like that Greg's previous suggestion
> >> >> for a
> >> >> > 15 min rollover may be a bit too much. With 1 hour we can achieve
> >> better
> >> >> > cost saving and less coordinate metadata being stored. I have also
> >> >> tried to
> >> >> > account for the producer batch metadata generated by diskless
> >> partitions
> >> >> > but to me it seems like a lower number than Greg's original numbers.
> >> >> >
> >> >> > 2) "Note that local storage could be lost on reassigned partitions.
> >> In
> >> >> > that case, lagging reads can only be served from the object store."
> >> >> > Yes, I think this is to be expected and a lot depends on the
> >> >> > implementation. Ideally segments or chunks should be cached to
> >> minimize
> >> >> the
> >> >> > number of times segments pulled from remote storage.
> >> >> >
> >> >> > "The 2MB/sec I quoted is for a specific broker. Depending on the
> >> broker
> >> >> > instance type, a broker may only be able to handle low 10s of MB/sec
> >> of
> >> >> > data. So, 2MB/sec overhead is significant."
> >> >> > Yes, I have indeed misunderstood, however I have updated my
> >> calculator
> >> >> > sheet with metadata calculation. Overall, the number of tiered
> >> storage
> >> >> > segments created seems to be much lower than in your calculations
> >> given
> >> >> the
> >> >> > parameters of the cluster you specified earlier. Please take a look,
> >> I'd
> >> >> > like to really understand the thinking here because this is a crucial
> >> >> point.
> >> >> >
> >> >> > 3) I think if my calculations are correct (and we use a 60 minute
> >> >> window),
> >> >> > then metadata generation should be slower, please see the google
> >> sheet I
> >> >> > linked above. I think given that traffic, the current topic based
> >> RLMM
> >> >> > should be able to handle it.
> >> >> > In the case where we would need to make the RLMM capable of handling
> >> a
> >> >> > similar traffic as the diskless coordinator, then you're right, we
> >> >> probably
> >> >> > should consider how we can improve it. I think there are multiple
> >> >> > possibilities as you mentioned, but ideally there should be a common
> >> >> > implementation for metadata coordination that could handle these
> >> cases.
> >> >> >
> >> >> > JR7.
> >> >> > Yes, your expectation is totally reasonable, we should expect the get
> >> >> and
> >> >> > put operations to be strongly consistent for the read-after-write
> >> >> > scenarios. And I think that since major cloud providers give strongly
> >> >> > consistent object storages, it should be sufficient for a wide
> >> >> user-group.
> >> >> > So we could shrink the scope of the KIP a bit this way and avoid
> >> adding
> >> >> > complexity that is needed mostly on the margin.
> >> >> > I can expect though that "list" can stay eventually consistent as the
> >> >> KIP
> >> >> > relies on it for only garbage collection where it is fine if a few
> >> >> segments
> >> >> > can be collected only in the next iteration.
> >> >> >
> >> >> > JR3.
> >> >> > Since Greg hasn't replied yet, I'll try to catch up with him and
> >> >> formulate
> >> >> > an answer next week.
> >> >> >
> >> >> > Best,
> >> >> > Viktor
> >> >> >
> >> >> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev <
> >> [email protected]>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi, Victor,
> >> >> >>
> >> >> >> Thanks for the reply.
> >> >> >>
> >> >> >> JR1.
> >> >> >> 1)  "So while it seems to be significant that we tripled the number
> >> of
> >> >> >> PUTs, cost-wise it doesn't seem to be significant."
> >> >> >> Let's compare the savings achieved by replacing network replication
> >> >> >> transfer with S3 puts in AWS.
> >> >> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB
> >> >> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request
> >> >> >>
> >> >> >> The KIP batches data up to 4MB. So, let's assume that we write 2MB
> >> S3
> >> >> >> objects on average.
> >> >> >>
> >> >> >> The cost for transferring 2MB through the network is 2 * 2 * 10^-5 =
> >> >> $4*
> >> >> >> 10^-5
> >> >> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings
> >> >> are
> >> >> >> about 75%.
> >> >> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings
> >> >> are
> >> >> >> 25%. As you can see, the savings are significantly lower.
> >> >> >>
> >> >> >> 2) "Therefore we could expect classic local segments to be present
> >> >> which
> >> >> >> could be used for catching up consumers."
> >> >> >> Note that local storage could be lost on reassigned partitions. In
> >> that
> >> >> >> case, lagging reads can only be served from the object store.
> >> >> >>
> >> >> >> "Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
> >> >> >> throughput that Greg calculated previously, so I think it should be
> >> >> >> manageable for a cluster with that amount of throughput,"
> >> >> >> It seems that you didn't make the correct comparison. 2GB/s that
> >> Greg
> >> >> >> mentioned is the throughput for the whole cluster. The 2MB/sec I
> >> >> quoted is
> >> >> >> for a specific broker. Depending on the broker instance type, a
> >> broker
> >> >> may
> >> >> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec
> >> overhead
> >> >> is
> >> >> >> significant.
> >> >> >>
> >> >> >> 3) "I'd separate it from the discussion of diskless core and
> >> perhaps we
> >> >> >> could address it in a separate KIP as it is mostly a redesign of the
> >> >> >> RLMM."
> >> >> >> Those problems don't exist in the existing usage of RLMM. They
> >> manifest
> >> >> >> because diskless tries to use RLMM in a way it wasn't designed for
> >> >> (there
> >> >> >> is at least a 20X increase in metadata). It would be useful to
> >> consider
> >> >> >> whether fixing those problems in RLMM or using a new approach is
> >> >> >> better. For example, KIP-1164 already introduces a snapshotting
> >> >> mechanism.
> >> >> >> Adding another snapshotting mechanism to RLMM seems redundant.
> >> >> >>
> >> >> >> JR7. A typical object store supports 3 operations: puts, gets and
> >> >> lists.
> >> >> >> Which operations used by diskless can be eventually consistent? I'd
> >> >> expect
> >> >> >> that get should always see the result of the latest put.
> >> >> >>
> >> >> >> Jun
> >> >> >>
> >> >> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass <
> >> [email protected]
> >> >> >
> >> >> >> wrote:
> >> >> >>
> >> >> >> > Hi Jun,
> >> >> >> >
> >> >> >> > I'd like to add my thoughts too until Greg has time to respond.
> >> >> >> >
> >> >> >> > JR1. I also think there are shortcomings in the current tiered
> >> >> storage
> >> >> >> > design, around the RLMM.
> >> >> >> > 1) I think this is a correct observation, however if my
> >> calculations
> >> >> are
> >> >> >> > correct, it actually comes down to a negligible amount of cost.
> >> >> Taking
> >> >> >> the
> >> >> >> > AWS pricing sheet at
> >> >> >> >
> >> >> >>
> >> >>
> >> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps
> >> <https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDFpWs-Lg$>
> >> >> <
> >> https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$
> >> >
> >> >> >> > it seems like the difference between 6 or 2 PUTs per second is
> >> ~$52
> >> >> for
> >> >> >> a
> >> >> >> > month. The calculation follows
> >> >> >> > as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. So
> >> >> while
> >> >> >> it
> >> >> >> > seems to be significant that we tripled the number of PUTs,
> >> >> cost-wise it
> >> >> >> > doesn't seem to be significant.
> >> >> >> > 2) Reflecting to your original problem: the tiered storage
> >> >> consolidation
> >> >> >> > process should be continuously running and transforming WAL
> >> segments
> >> >> >> into
> >> >> >> > classic logs. Therefore we could expect classic local segments to
> >> be
> >> >> >> > present which could be used for catching up consumers. So they
> >> would
> >> >> >> only
> >> >> >> > switch to WAL reading when they're close to the end of the log.
> >> Since
> >> >> >> this
> >> >> >> > offset space should be cached, the reads from there should be
> >> fast.
> >> >> >> > Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
> >> >> >> > throughput that Greg calculated previously, so I think it should
> >> be
> >> >> >> > manageable for a cluster with that amount of throughput, although
> >> I
> >> >> >> agree
> >> >> >> > with your comment that the current topic based tiered metadata
> >> >> manager
> >> >> >> > isn't optimal and we could develop a better solution.
> >> >> >> > 3) Tied to the previous point, I agree that your comments are
> >> >> absolutely
> >> >> >> > valid, however similarly to that, I'd separate it from the
> >> >> discussion of
> >> >> >> > diskless core and perhaps we could address it in a separate KIP as
> >> >> it is
> >> >> >> > mostly a redesign of the RLMM.
> >> >> >> >
> >> >> >> > JR2. Ack. We will raise a KIP in the near future.
> >> >> >> >
> >> >> >> > JR3. I'd leave answering this to Greg as I don't have too much
> >> >> context
> >> >> >> on
> >> >> >> > this one.
> >> >> >> >
> >> >> >> > JR7. I think this could be similar to the tiered storage design,
> >> so
> >> >> any
> >> >> >> > coordinator operation should be strongly consistent (since we're
> >> >> using
> >> >> >> > classic topics there). Therefore the WAL segment storage layer
> >> could
> >> >> be
> >> >> >> > eventually consistent as we store its metadata in a strongly
> >> >> consistent
> >> >> >> > manner. I'm not sure though if this was the answer you're looking
> >> >> for?
> >> >> >> >
> >> >> >> > Best,
> >> >> >> > Viktor
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev <
> >> >> [email protected]>
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Hi, Greg,
> >> >> >> >>
> >> >> >> >> Thanks for the reply.
> >> >> >> >>
> >> >> >> >> JR1. Rolling log segments every 15 minutes addresses the 3
> >> concerns
> >> >> I
> >> >> >> >> listed, but it introduces some new issues because it doesn't
> >> quite
> >> >> fit
> >> >> >> the
> >> >> >> >> design of the current tiered storage. (a) The current tiered
> >> storage
> >> >> >> >> design
> >> >> >> >> stores a single partition per object. If we roll a log segment
> >> >> every 15
> >> >> >> >> minutes, with 4K partitions per broker, this means an additional
> >> 4
> >> >> S3
> >> >> >> puts
> >> >> >> >> per second. The diskless design aims for 2 S3 puts per second.
> >> So,
> >> >> this
> >> >> >> >> triples the S3 put cost and reduces the savings benefits. (b)
> >> With
> >> >> Tier
> >> >> >> >> storage, each broker essentially needs to read the tier metadata
> >> >> from
> >> >> >> all
> >> >> >> >> tier metadata partitions if the number of user partitions exceeds
> >> >> 50.
> >> >> >> >> Assuming that we generate 100 bytes of tier metadata per
> >> partition
> >> >> >> every
> >> >> >> >> 15
> >> >> >> >> minutes. Assuming that each broker has 4K partitions and a
> >> cluster
> >> >> of
> >> >> >> 500
> >> >> >> >> brokers. Each broker needs to receive tier metadata at a rate of
> >> >> 100 *
> >> >> >> 4K
> >> >> >> >> *
> >> >> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the 50
> >> tier
> >> >> >> >> metadata topic partitions, it needs to send out metadata at 100 *
> >> >> 4K *
> >> >> >> 500
> >> >> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary
> >> network
> >> >> >> and
> >> >> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A
> >> >> restarted
> >> >> >> >> broker needs to replay the tier metadata log from the beginning
> >> to
> >> >> >> build
> >> >> >> >> the tier metadata state. Suppose that the tier metadata log is
> >> kept
> >> >> >> for 7
> >> >> >> >> days. The total amount of tier metadata that needs to be
> >> replayed is
> >> >> >> 200KB
> >> >> >> >> * 7 * 24 * 3600 = 120GB.
> >> >> >> >> Does the merging optimization you mentioned address those new
> >> >> >> concerns? If
> >> >> >> >> so, could you describe how it works?
> >> >> >> >>
> >> >> >> >> JR2. It's fine to cover the default partition assignment strategy
> >> >> for
> >> >> >> >> diskless topics in a separate KIP. However, since this is
> >> essential
> >> >> for
> >> >> >> >> achieving the cost saving goal, we need a solution before
> >> releasing
> >> >> the
> >> >> >> >> diskless KIP.
> >> >> >> >>
> >> >> >> >> JR3. Sounds good. Could you document how this work?
> >> >> >> >>
> >> >> >> >> JR7. Could you describe which parts of the operation can be
> >> >> eventually
> >> >> >> >> consistent?
> >> >> >> >>
> >> >> >> >> Jun
> >> >> >> >>
> >> >> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris <
> >> [email protected]>
> >> >> >> wrote:
> >> >> >> >>
> >> >> >> >> > Hi Jun,
> >> >> >> >> >
> >> >> >> >> > Thanks for your comments!
> >> >> >> >> >
> >> >> >> >> > JR1:
> >> >> >> >> > You are correct that the segment rolling configurations are
> >> >> currently
> >> >> >> >> > critical to balance the scalability of Diskless and Tiered
> >> >> Storage,
> >> >> >> as
> >> >> >> >> > larger roll configurations benefit tiered storage, and smaller
> >> >> roll
> >> >> >> >> > configurations benefit Diskless.
> >> >> >> >> >
> >> >> >> >> > To address your points specifically:
> >> >> >> >> > (1) A Diskless topic which is cost-competitive with an
> >> equivalent
> >> >> >> >> Classic
> >> >> >> >> > topic will have a metadata size <1% of the data size. A cluster
> >> >> >> storing
> >> >> >> >> > 360GB of metadata will have >36TB of data under management and
> >> a
> >> >> >> >> retention
> >> >> >> >> > of 5hr implies a throughput of >2GB/s. This will require
> >> multiple
> >> >> >> >> Diskless
> >> >> >> >> > coordinators, which can share the load of storing the Diskless
> >> >> >> metadata,
> >> >> >> >> > and serving Diskless requests.
> >> >> >> >> > (2) Catching up consumers are intended to be served from tiered
> >> >> >> storage
> >> >> >> >> > and local segment caches. Brokers which are building their
> >> local
> >> >> >> segment
> >> >> >> >> > caches will have to read many files, but will amortize those
> >> >> reads by
> >> >> >> >> > receiving data for multiple partitions in a single read.
> >> >> >> >> > (3) This is a fundamental downside of storing data from
> >> multiple
> >> >> >> topics
> >> >> >> >> in
> >> >> >> >> > a single object, similar to classic segments. We can implement
> >> a
> >> >> >> >> > configurable cluster-wide maximum roll time, which would set
> >> the
> >> >> >> slowest
> >> >> >> >> > cadence at which Tiered Storage segments are rolled from
> >> Diskless
> >> >> >> >> segments.
> >> >> >> >> > If an individual partition has more aggressive roll settings,
> >> it
> >> >> may
> >> >> >> be
> >> >> >> >> > rolled earlier.
> >> >> >> >> > This configuration would permit the cluster operator to
> >> >> approximately
> >> >> >> >> > bound the number of diskless WAL segments, which bounds the
> >> total
> >> >> >> size
> >> >> >> >> of
> >> >> >> >> > the WAL segments, disk cache, diskless coordinator state, and
> >> >> >> excessive
> >> >> >> >> > retention window. For example, a diskless.segment.ms
> >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$>
> >> >> <
> >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >> >
> >> >> of 15 minutes
> >> >> >> >> would
> >> >> >> >> > reduce the metadata storage to 18GB, WAL segments to 1.8TB, and
> >> >> >> permit
> >> >> >> >> > short-retention data to be physically deleted as soon as ~15
> >> >> minutes
> >> >> >> >> after
> >> >> >> >> > being produced.
> >> >> >> >> > Of course, this will reduce the size of the tiered storage
> >> >> segments
> >> >> >> for
> >> >> >> >> > topics that have low throughput, and where segment.ms
> >> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$>
> >> >> <
> >> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
> >> >
> >> >> >
> >> >> >> >> > diskless.segment.ms
> >> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDluPtSxE$>
> >> >> <
> >> https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$
> >> >,
> >> >> increasing overhead in the RLMM. We can perform
> >> >> >> >> > merging/optimization of Tiered Storage segments to achieve the
> >> >> >> per-topic
> >> >> >> >> > segment.ms
> >> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDyo9_OLg$>
> >> >> <
> >> https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$
> >> >
> >> >> .
> >> >> >> >> > There were some reasons why we retracted the prior file-merging
> >> >> >> >> approach,
> >> >> >> >> > and why merging in tiered storage appears better:
> >> >> >> >> > * Rewriting files requires mutability for existing data, which
> >> >> adds
> >> >> >> >> > complexity. Diskless batches or Remote Log Segments would need
> >> to
> >> >> be
> >> >> >> >> made
> >> >> >> >> > mutable, and the remote log will be made mutable in KIP-1272
> >> [1]
> >> >> >> >> > * Because a WAL Segment can contain batches from multiple
> >> Diskless
> >> >> >> >> > Coordinators, multiple coordinators must also be involved in
> >> the
> >> >> >> merging
> >> >> >> >> > step. The Tiered Storage design has exclusive ownership for
> >> remote
> >> >> >> log
> >> >> >> >> > segments within the RLMM.
> >> >> >> >> > * Diskless file merging competes for resources with
> >> >> latency-sensitive
> >> >> >> >> > producers and hot consumers. Tiered storage file merging
> >> competes
> >> >> for
> >> >> >> >> > resources with lagging consumers, which are typically less
> >> latency
> >> >> >> >> > sensitive.
> >> >> >> >> > * Implementing merging in Tiered Storage allows this
> >> optimization
> >> >> to
> >> >> >> >> > benefit both classic topics and diskless topics, covering both
> >> >> high
> >> >> >> and
> >> >> >> >> low
> >> >> >> >> > throughput partitions.
> >> >> >> >> > * Remote log segments may be optimized over much longer time
> >> >> windows
> >> >> >> >> > rather than performing optimization once in the first few
> >> hours of
> >> >> >> the
> >> >> >> >> life
> >> >> >> >> > of a WAL segment and then freezing the arrangement of the data
> >> >> until
> >> >> >> it
> >> >> >> >> is
> >> >> >> >> > deleted.
> >> >> >> >> > * File merging will need to rely on heuristics, which should be
> >> >> >> >> > configurable by the user. Multi-partition heuristics are more
> >> >> >> >> complicated
> >> >> >> >> > to describe and reason about than single-partition heuristics.
> >> >> >> >> > What do you think of this alternative?
> >> >> >> >> >
> >> >> >> >> > JR2:
> >> >> >> >> > Yes, the current default partition assignment strategy will
> >> need
> >> >> some
> >> >> >> >> > improvement. This problem with Diskless WAL segments is
> >> analogous
> >> >> to
> >> >> >> the
> >> >> >> >> > Classic topics’ dense inter-broker connection graph.
> >> >> >> >> > The natural solution to this seems to be some sort of cellular
> >> >> >> design,
> >> >> >> >> > where the replica placements tend to locate partitions in
> >> similar
> >> >> >> >> groups.
> >> >> >> >> > Partitions in the same cell can generally share the same WAL
> >> >> Segments
> >> >> >> >> and
> >> >> >> >> > the same Diskless Coordinator requests. This would also benefit
> >> >> >> Classic
> >> >> >> >> > topics, which would need fewer connections and fetch requests.
> >> >> >> >> > Such a feature is out-of-scope of this KIP, and either we will
> >> >> >> publish a
> >> >> >> >> > follow-up KIP, or let operators and community tooling address
> >> >> this.
> >> >> >> >> >
> >> >> >> >> > JR3:
> >> >> >> >> > Yes we will replace the ISR/ELR election logic for diskless
> >> >> topics,
> >> >> >> as
> >> >> >> >> > they no longer rely on replicas for data integrity. We will
> >> fully
> >> >> >> model
> >> >> >> >> the
> >> >> >> >> > state/lifecycle of the diskless replicas in KRaft, and choose
> >> how
> >> >> we
> >> >> >> >> > display this to clients.
> >> >> >> >> > For backwards compatibility, clients using older metadata
> >> requests
> >> >> >> >> should
> >> >> >> >> > see diskless topics, but interpret them as classic topics. We
> >> >> could
> >> >> >> tell
> >> >> >> >> > older clients that the leader is in the ISR, even if it just
> >> >> started
> >> >> >> >> > building its cache.
> >> >> >> >> > For clients using the latest metadata, they should see the true
> >> >> >> state of
> >> >> >> >> > the diskless partition: which nodes can accept
> >> >> >> produce/fetch/sharefetch
> >> >> >> >> > requests, which ranges of offsets are cached on-broker, etc.
> >> This
> >> >> >> could
> >> >> >> >> > also be used to break apart the “leader” field into more
> >> granular
> >> >> >> >> fields,
> >> >> >> >> > now that leadership has changed meaning.
> >> >> >> >> >
> >> >> >> >> > JR4:
> >> >> >> >> > Yes, we can replace the empty fetch requests to the leader
> >> nodes
> >> >> with
> >> >> >> >> > cache hint fields in the requests to the Diskless Coordinator,
> >> and
> >> >> >> rely
> >> >> >> >> on
> >> >> >> >> > the coordinator to distribute cache hints to all replicas. This
> >> >> >> should
> >> >> >> >> be
> >> >> >> >> > low-overhead, and eliminate the inter-broker communication for
> >> >> >> brokers
> >> >> >> >> > which only host Diskless topics.
> >> >> >> >> >
> >> >> >> >> > JR5.1:
> >> >> >> >> > You are correct and this text was ambiguous, only specifying
> >> that
> >> >> the
> >> >> >> >> > controller waits for the sync to be complete. This section is
> >> now
> >> >> >> >> updated
> >> >> >> >> > to explicitly say that local segments are built from object
> >> >> storage.
> >> >> >> >> >
> >> >> >> >> > JR5.2:
> >> >> >> >> > Extending the JR2 discussion, reassignment of diskless topics
> >> >> would
> >> >> >> >> > generally happen within a cell, where the marginal cost of
> >> >> reading an
> >> >> >> >> > additional partition is very low. When cells are re-balanced
> >> and a
> >> >> >> >> > partition is migrated between cells, there is a brief time
> >> (until
> >> >> the
> >> >> >> >> next
> >> >> >> >> > Tiered Storage segment roll) when the marginal cost is doubled.
> >> >> This
> >> >> >> >> should
> >> >> >> >> > be infrequent and well-amortized by other topics which aren’t
> >> >> being
> >> >> >> >> > re-balanced between cells.
> >> >> >> >> >
> >> >> >> >> > JR6.1:
> >> >> >> >> > We plan to move data from Diskless to Tiered Storage. Once the
> >> >> data
> >> >> >> is
> >> >> >> >> in
> >> >> >> >> > Tiered Storage, it can be compacted using the functionality
> >> >> >> described in
> >> >> >> >> > KIP-1272 [1]
> >> >> >> >> >
> >> >> >> >> > JR6.2:
> >> >> >> >> > We will add details for this soon.
> >> >> >> >> >
> >> >> >> >> > JR7:
> >> >> >> >> > We specify the requirement of eventual consistency to allow
> >> >> Diskless
> >> >> >> >> > Topics to be used with other object storage implementations
> >> which
> >> >> >> aren’t
> >> >> >> >> > the three major public clouds, such as self-managed software or
> >> >> >> weaker
> >> >> >> >> > consistency caches.
> >> >> >> >> >
> >> >> >> >> > Thanks,
> >> >> >> >> > Greg
> >> >> >> >> >
> >> >> >> >> > [1]
> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage
> >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND2ONImL0$>
> >> >> <
> >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$
> >> >
> >> >> >> >> >
> >> >> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <
> >> >> [email protected]
> >> >> >> >
> >> >> >> >> > wrote:
> >> >> >> >> >
> >> >> >> >> >> Hi, Ivan,
> >> >> >> >> >>
> >> >> >> >> >> Thanks for the KIP. A few comments below.
> >> >> >> >> >>
> >> >> >> >> >> JR1. I am concerned about the usage of the current tiered
> >> >> storage to
> >> >> >> >> >> control the number of small WAL files. Current tiered storage
> >> >> only
> >> >> >> >> tiers
> >> >> >> >> >> the data when a segment rolls, which can take hours. This
> >> causes
> >> >> >> three
> >> >> >> >> >> problems. (1) Much more metadata needs to be stored and
> >> >> maintained,
> >> >> >> >> which
> >> >> >> >> >> increases the cost. Suppose that each segment rolls every 5
> >> >> hours,
> >> >> >> each
> >> >> >> >> >> partition generates 2 WAL files per second and each WAL file's
> >> >> >> metadata
> >> >> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K * 2 *
> >> 100
> >> >> =
> >> >> >> >> 3.6MB
> >> >> >> >> >> of
> >> >> >> >> >> metadata. In a cluster with 100K partitions, this translates
> >> to
> >> >> >> 360GB
> >> >> >> >> of
> >> >> >> >> >> metadata stored on the diskless coordinators. (2) A
> >> catching-up
> >> >> >> >> consumer's
> >> >> >> >> >> performance degrades since it's forced to read data from many
> >> >> small
> >> >> >> WAL
> >> >> >> >> >> files. (3) The data in WAL files could be retained much longer
> >> >> than
> >> >> >> >> >> retention time. Since the small WAL files aren't completely
> >> >> deleted
> >> >> >> >> until
> >> >> >> >> >> all partitions' data in it are obsolete, the deletion of the
> >> WAL
> >> >> >> files
> >> >> >> >> >> could be delayed by hours or more. If the WAL file includes a
> >> >> >> partition
> >> >> >> >> >> with a low retention time, the retention contract could be
> >> >> violated
> >> >> >> >> >> significantly. The earlier design of the KIP included a
> >> separate
> >> >> >> object
> >> >> >> >> >> merging process that combines small WAL files much more
> >> >> aggressively
> >> >> >> >> than
> >> >> >> >> >> tiered storage, which seems to be a much better choice.
> >> >> >> >> >>
> >> >> >> >> >> JR2. I don't think the current default partition assignment
> >> >> strategy
> >> >> >> >> for
> >> >> >> >> >> classic topics works for diskless topics. Current strategy
> >> tries
> >> >> to
> >> >> >> >> spread
> >> >> >> >> >> the replicas to as many brokers as possible. For example, if a
> >> >> >> broker
> >> >> >> >> has
> >> >> >> >> >> 100 partitions, their replicas could be spread over 100
> >> brokers.
> >> >> If
> >> >> >> the
> >> >> >> >> >> broker generates a WAL file with 100 partitions, this WAL file
> >> >> will
> >> >> >> be
> >> >> >> >> >> read
> >> >> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of the
> >> cost
> >> >> of
> >> >> >> S3
> >> >> >> >> >> put.
> >> >> >> >> >> This assignment strategy will increase the S3 cost by about
> >> 8X,
> >> >> >> which
> >> >> >> >> is
> >> >> >> >> >> prohibitive. We need to design a cost effective assignment
> >> >> strategy
> >> >> >> for
> >> >> >> >> >> diskless topics.
> >> >> >> >> >>
> >> >> >> >> >> JR3. We need to think through the leade election logic with
> >> >> diskless
> >> >> >> >> >> topic.
> >> >> >> >> >> The KIP tries to reuse the ISR logic for class topic, but it
> >> >> doesn't
> >> >> >> >> seem
> >> >> >> >> >> very natural.
> >> >> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In the
> >> >> >> diskless
> >> >> >> >> >> topic, the KIP says that a leader could be out of sync.
> >> >> >> >> >> JR3.2 The existing leader election logic based on ISR/ELR
> >> mainly
> >> >> >> >> retries
> >> >> >> >> >> to
> >> >> >> >> >> preserve previously acknowledged data. With diskless topics,
> >> >> since
> >> >> >> the
> >> >> >> >> >> object store provides durability, this logic seems no longer
> >> >> needed.
> >> >> >> >> The
> >> >> >> >> >> existing min.isr and unclean leader election logic also don't
> >> >> apply.
> >> >> >> >> >>
> >> >> >> >> >> JR4. "Despite that there is no inter-broker replication,
> >> replicas
> >> >> >> will
> >> >> >> >> >> still issue FetchRequest to leaders. Leaders will respond with
> >> >> empty
> >> >> >> >> (no
> >> >> >> >> >> records) FetchResponse."
> >> >> >> >> >> This seems unnatural. Could we avoid issuing inter broker
> >> fetch
> >> >> >> >> requests
> >> >> >> >> >> for diskless topics?
> >> >> >> >> >>
> >> >> >> >> >> JR5. "The replica reassignment will follow the same flow as in
> >> >> >> classic
> >> >> >> >> >> topic:".
> >> >> >> >> >> JR5.1 Is this true? Since inter broker fetch response is alway
> >> >> >> empty,
> >> >> >> >> it
> >> >> >> >> >> doesn't seem the current reassignment flow works for diskless
> >> >> topic.
> >> >> >> >> Also,
> >> >> >> >> >> since the source of the data is object store, it seems more
> >> >> natural
> >> >> >> >> for a
> >> >> >> >> >> replica to back fill the data from the object store, instead
> >> of
> >> >> >> other
> >> >> >> >> >> replicas. This will also incur lower costs.
> >> >> >> >> >> JR5.2 How do we prevent reassignment on diskless topics from
> >> >> causing
> >> >> >> >> the
> >> >> >> >> >> same cost issue described in JR2?
> >> >> >> >> >>
> >> >> >> >> >> JR6." In other functional aspects, diskless topics are
> >> >> >> >> indistinguishable
> >> >> >> >> >> from classic topics. This includes durability guarantees,
> >> >> ordering
> >> >> >> >> >> guarantees, transactional and non-transactional producer API,
> >> >> >> consumer
> >> >> >> >> >> API,
> >> >> >> >> >> consumer groups, share groups, data retention (deletion &
> >> >> compact),"
> >> >> >> >> >> JR6.1 Could you describe how compact diskless topics are
> >> >> supported?
> >> >> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the
> >> transactional
> >> >> >> >> support in
> >> >> >> >> >> detail.
> >> >> >> >> >>
> >> >> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and
> >> >> eventually
> >> >> >> >> >> consistent storage supporting arbitrary sized byte values and
> >> a
> >> >> >> minimal
> >> >> >> >> >> set
> >> >> >> >> >> of atomic operations: put, delete, list, and ranged get."
> >> >> >> >> >> It seems that the object storage in all three major public
> >> clouds
> >> >> >> are
> >> >> >> >> >> strongly consistent.
> >> >> >> >> >>
> >> >> >> >> >> Jun
> >> >> >> >> >>
> >> >> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]
> >> >
> >> >> >> wrote:
> >> >> >> >> >>
> >> >> >> >> >> > Hi all,
> >> >> >> >> >> >
> >> >> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's now
> >> >> focus on
> >> >> >> >> the
> >> >> >> >> >> > technical details presented in this KIP-1163 and also in
> >> >> KIP-1164:
> >> >> >> >> >> Diskless
> >> >> >> >> >> > Coordinator  [1].
> >> >> >> >> >> >
> >> >> >> >> >> > Best,
> >> >> >> >> >> > Ivan
> >> >> >> >> >> >
> >> >> >> >> >> > [1]
> >> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
> >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDZKiPB2A$>
> >> >> <
> >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$
> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
> >> >> >> >> >> > > Hi all!
> >> >> >> >> >> > >
> >> >> >> >> >> > > We want to start the discussion thread for KIP-1163:
> >> Diskless
> >> >> >> Core
> >> >> >> >> >> [1],
> >> >> >> >> >> > which is a sub-KIP for KIP-1150 [2].
> >> >> >> >> >> > >
> >> >> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for
> >> high-level
> >> >> >> >> >> questions,
> >> >> >> >> >> > motivation, and general direction of the feature and this
> >> >> thread
> >> >> >> for
> >> >> >> >> >> > particular details of implementation.
> >> >> >> >> >> > >
> >> >> >> >> >> > > Best,
> >> >> >> >> >> > > Ivan
> >> >> >> >> >> > >
> >> >> >> >> >> > > [1]
> >> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
> >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDrNzi-QI$>
> >> >> <
> >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$
> >> >
> >> >> >> >> >> > > [2]
> >> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> >> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbNDgFavpPM$>
> >> >> <
> >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$
> >> >
> >> >> >> >> >> > > [3]
> >> >> >> >> https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
> >> <https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!qD6UWpGNFDAUbr00WyBVsibHKHuiQKFjLSaOflC2lBt2rFw-s6OPvGrHyI1HZlkWV6j9UbND75I4_MY$>
> >> >> <
> >> https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$
> >> >
> >> >> >> >> >> >
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >

Re: [DISCUSS] KIP-1163: Diskless Core

Reply via email to