Re: [DISCUSS] KIP-1163: Diskless Core

Jun Rao via dev Tue, 21 Apr 2026 11:16:51 -0700

Hi, Victor,

Thanks for the reply.


JR1.
1)  "So while it seems to be significant that we tripled the number of
PUTs, cost-wise it doesn't seem to be significant."
Let's compare the savings achieved by replacing network replication
transfer with S3 puts in AWS.
network transfer cost: $0.02/GB = $2 * 10^-5/MB
S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request

The KIP batches data up to 4MB. So, let's assume that we write 2MB S3
objects on average.

The cost for transferring 2MB through the network is 2 * 2 * 10^-5 = $4*
10^-5
If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings are
about 75%.
If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings are
25%. As you can see, the savings are significantly lower.

2) "Therefore we could expect classic local segments to be present which
could be used for catching up consumers."
Note that local storage could be lost on reassigned partitions. In that
case, lagging reads can only be served from the object store.

"Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
throughput that Greg calculated previously, so I think it should be
manageable for a cluster with that amount of throughput,"
It seems that you didn't make the correct comparison. 2GB/s that Greg
mentioned is the throughput for the whole cluster. The 2MB/sec I quoted is
for a specific broker. Depending on the broker instance type, a broker may
only be able to handle low 10s of MB/sec of data. So, 2MB/sec overhead is
significant.

3) "I'd separate it from the discussion of diskless core and perhaps we
could address it in a separate KIP as it is mostly a redesign of the RLMM."
Those problems don't exist in the existing usage of RLMM. They manifest
because diskless tries to use RLMM in a way it wasn't designed for (there
is at least a 20X increase in metadata). It would be useful to consider
whether fixing those problems in RLMM or using a new approach is
better. For example, KIP-1164 already introduces a snapshotting mechanism.
Adding another snapshotting mechanism to RLMM seems redundant.

JR7. A typical object store supports 3 operations: puts, gets and lists.
Which operations used by diskless can be eventually consistent? I'd expect
that get should always see the result of the latest put.

Jun

On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass <[email protected]>
wrote:

> Hi Jun,
>
> I'd like to add my thoughts too until Greg has time to respond.
>
> JR1. I also think there are shortcomings in the current tiered storage
> design, around the RLMM.
> 1) I think this is a correct observation, however if my calculations are
> correct, it actually comes down to a negligible amount of cost. Taking the
> AWS pricing sheet at
> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps
> it seems like the difference between 6 or 2 PUTs per second is ~$52 for a
> month. The calculation follows
> as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. So while it
> seems to be significant that we tripled the number of PUTs, cost-wise it
> doesn't seem to be significant.
> 2) Reflecting to your original problem: the tiered storage consolidation
> process should be continuously running and transforming WAL segments into
> classic logs. Therefore we could expect classic local segments to be
> present which could be used for catching up consumers. So they would only
> switch to WAL reading when they're close to the end of the log. Since this
> offset space should be cached, the reads from there should be fast.
> Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
> throughput that Greg calculated previously, so I think it should be
> manageable for a cluster with that amount of throughput, although I agree
> with your comment that the current topic based tiered metadata manager
> isn't optimal and we could develop a better solution.
> 3) Tied to the previous point, I agree that your comments are absolutely
> valid, however similarly to that, I'd separate it from the discussion of
> diskless core and perhaps we could address it in a separate KIP as it is
> mostly a redesign of the RLMM.
>
> JR2. Ack. We will raise a KIP in the near future.
>
> JR3. I'd leave answering this to Greg as I don't have too much context on
> this one.
>
> JR7. I think this could be similar to the tiered storage design, so any
> coordinator operation should be strongly consistent (since we're using
> classic topics there). Therefore the WAL segment storage layer could be
> eventually consistent as we store its metadata in a strongly consistent
> manner. I'm not sure though if this was the answer you're looking for?
>
> Best,
> Viktor
>
>
>
> On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev <[email protected]>
> wrote:
>
>> Hi, Greg,
>>
>> Thanks for the reply.
>>
>> JR1. Rolling log segments every 15 minutes addresses the 3 concerns I
>> listed, but it introduces some new issues because it doesn't quite fit the
>> design of the current tiered storage. (a) The current tiered storage
>> design
>> stores a single partition per object. If we roll a log segment every 15
>> minutes, with 4K partitions per broker, this means an additional 4 S3 puts
>> per second. The diskless design aims for 2 S3 puts per second. So, this
>> triples the S3 put cost and reduces the savings benefits. (b) With Tier
>> storage, each broker essentially needs to read the tier metadata from all
>> tier metadata partitions if the number of user partitions exceeds 50.
>> Assuming that we generate 100 bytes of tier metadata per partition every
>> 15
>> minutes. Assuming that each broker has 4K partitions and a cluster of 500
>> brokers. Each broker needs to receive tier metadata at a rate of 100 * 4K
>> *
>> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the 50 tier
>> metadata topic partitions, it needs to send out metadata at 100 * 4K * 500
>> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary network and
>> CPU overhead. (c) Tier storage doesn't support snapshots. A restarted
>> broker needs to replay the tier metadata log from the beginning to build
>> the tier metadata state. Suppose that the tier metadata log is kept for 7
>> days. The total amount of tier metadata that needs to be replayed is 200KB
>> * 7 * 24 * 3600 = 120GB.
>> Does the merging optimization you mentioned address those new concerns? If
>> so, could you describe how it works?
>>
>> JR2. It's fine to cover the default partition assignment strategy for
>> diskless topics in a separate KIP. However, since this is essential for
>> achieving the cost saving goal, we need a solution before releasing the
>> diskless KIP.
>>
>> JR3. Sounds good. Could you document how this work?
>>
>> JR7. Could you describe which parts of the operation can be eventually
>> consistent?
>>
>> Jun
>>
>> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris <[email protected]> wrote:
>>
>> > Hi Jun,
>> >
>> > Thanks for your comments!
>> >
>> > JR1:
>> > You are correct that the segment rolling configurations are currently
>> > critical to balance the scalability of Diskless and Tiered Storage, as
>> > larger roll configurations benefit tiered storage, and smaller roll
>> > configurations benefit Diskless.
>> >
>> > To address your points specifically:
>> > (1) A Diskless topic which is cost-competitive with an equivalent
>> Classic
>> > topic will have a metadata size <1% of the data size. A cluster storing
>> > 360GB of metadata will have >36TB of data under management and a
>> retention
>> > of 5hr implies a throughput of >2GB/s. This will require multiple
>> Diskless
>> > coordinators, which can share the load of storing the Diskless metadata,
>> > and serving Diskless requests.
>> > (2) Catching up consumers are intended to be served from tiered storage
>> > and local segment caches. Brokers which are building their local segment
>> > caches will have to read many files, but will amortize those reads by
>> > receiving data for multiple partitions in a single read.
>> > (3) This is a fundamental downside of storing data from multiple topics
>> in
>> > a single object, similar to classic segments. We can implement a
>> > configurable cluster-wide maximum roll time, which would set the slowest
>> > cadence at which Tiered Storage segments are rolled from Diskless
>> segments.
>> > If an individual partition has more aggressive roll settings, it may be
>> > rolled earlier.
>> > This configuration would permit the cluster operator to approximately
>> > bound the number of diskless WAL segments, which bounds the total size
>> of
>> > the WAL segments, disk cache, diskless coordinator state, and excessive
>> > retention window. For example, a diskless.segment.ms of 15 minutes
>> would
>> > reduce the metadata storage to 18GB, WAL segments to 1.8TB, and permit
>> > short-retention data to be physically deleted as soon as ~15 minutes
>> after
>> > being produced.
>> > Of course, this will reduce the size of the tiered storage segments for
>> > topics that have low throughput, and where segment.ms >
>> > diskless.segment.ms, increasing overhead in the RLMM. We can perform
>> > merging/optimization of Tiered Storage segments to achieve the per-topic
>> > segment.ms.
>> > There were some reasons why we retracted the prior file-merging
>> approach,
>> > and why merging in tiered storage appears better:
>> > * Rewriting files requires mutability for existing data, which adds
>> > complexity. Diskless batches or Remote Log Segments would need to be
>> made
>> > mutable, and the remote log will be made mutable in KIP-1272 [1]
>> > * Because a WAL Segment can contain batches from multiple Diskless
>> > Coordinators, multiple coordinators must also be involved in the merging
>> > step. The Tiered Storage design has exclusive ownership for remote log
>> > segments within the RLMM.
>> > * Diskless file merging competes for resources with latency-sensitive
>> > producers and hot consumers. Tiered storage file merging competes for
>> > resources with lagging consumers, which are typically less latency
>> > sensitive.
>> > * Implementing merging in Tiered Storage allows this optimization to
>> > benefit both classic topics and diskless topics, covering both high and
>> low
>> > throughput partitions.
>> > * Remote log segments may be optimized over much longer time windows
>> > rather than performing optimization once in the first few hours of the
>> life
>> > of a WAL segment and then freezing the arrangement of the data until it
>> is
>> > deleted.
>> > * File merging will need to rely on heuristics, which should be
>> > configurable by the user. Multi-partition heuristics are more
>> complicated
>> > to describe and reason about than single-partition heuristics.
>> > What do you think of this alternative?
>> >
>> > JR2:
>> > Yes, the current default partition assignment strategy will need some
>> > improvement. This problem with Diskless WAL segments is analogous to the
>> > Classic topics’ dense inter-broker connection graph.
>> > The natural solution to this seems to be some sort of cellular design,
>> > where the replica placements tend to locate partitions in similar
>> groups.
>> > Partitions in the same cell can generally share the same WAL Segments
>> and
>> > the same Diskless Coordinator requests. This would also benefit Classic
>> > topics, which would need fewer connections and fetch requests.
>> > Such a feature is out-of-scope of this KIP, and either we will publish a
>> > follow-up KIP, or let operators and community tooling address this.
>> >
>> > JR3:
>> > Yes we will replace the ISR/ELR election logic for diskless topics, as
>> > they no longer rely on replicas for data integrity. We will fully model
>> the
>> > state/lifecycle of the diskless replicas in KRaft, and choose how we
>> > display this to clients.
>> > For backwards compatibility, clients using older metadata requests
>> should
>> > see diskless topics, but interpret them as classic topics. We could tell
>> > older clients that the leader is in the ISR, even if it just started
>> > building its cache.
>> > For clients using the latest metadata, they should see the true state of
>> > the diskless partition: which nodes can accept produce/fetch/sharefetch
>> > requests, which ranges of offsets are cached on-broker, etc. This could
>> > also be used to break apart the “leader” field into more granular
>> fields,
>> > now that leadership has changed meaning.
>> >
>> > JR4:
>> > Yes, we can replace the empty fetch requests to the leader nodes with
>> > cache hint fields in the requests to the Diskless Coordinator, and rely
>> on
>> > the coordinator to distribute cache hints to all replicas. This should
>> be
>> > low-overhead, and eliminate the inter-broker communication for brokers
>> > which only host Diskless topics.
>> >
>> > JR5.1:
>> > You are correct and this text was ambiguous, only specifying that the
>> > controller waits for the sync to be complete. This section is now
>> updated
>> > to explicitly say that local segments are built from object storage.
>> >
>> > JR5.2:
>> > Extending the JR2 discussion, reassignment of diskless topics would
>> > generally happen within a cell, where the marginal cost of reading an
>> > additional partition is very low. When cells are re-balanced and a
>> > partition is migrated between cells, there is a brief time (until the
>> next
>> > Tiered Storage segment roll) when the marginal cost is doubled. This
>> should
>> > be infrequent and well-amortized by other topics which aren’t being
>> > re-balanced between cells.
>> >
>> > JR6.1:
>> > We plan to move data from Diskless to Tiered Storage. Once the data is
>> in
>> > Tiered Storage, it can be compacted using the functionality described in
>> > KIP-1272 [1]
>> >
>> > JR6.2:
>> > We will add details for this soon.
>> >
>> > JR7:
>> > We specify the requirement of eventual consistency to allow Diskless
>> > Topics to be used with other object storage implementations which aren’t
>> > the three major public clouds, such as self-managed software or weaker
>> > consistency caches.
>> >
>> > Thanks,
>> > Greg
>> >
>> > [1]
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage
>> >
>> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <[email protected]>
>> > wrote:
>> >
>> >> Hi, Ivan,
>> >>
>> >> Thanks for the KIP. A few comments below.
>> >>
>> >> JR1. I am concerned about the usage of the current tiered storage to
>> >> control the number of small WAL files. Current tiered storage only
>> tiers
>> >> the data when a segment rolls, which can take hours. This causes three
>> >> problems. (1) Much more metadata needs to be stored and maintained,
>> which
>> >> increases the cost. Suppose that each segment rolls every 5 hours, each
>> >> partition generates 2 WAL files per second and each WAL file's metadata
>> >> takes 100 bytes. Each partition will generate 5 * 3.6K * 2 * 100 =
>> 3.6MB
>> >> of
>> >> metadata. In a cluster with 100K partitions, this translates to 360GB
>> of
>> >> metadata stored on the diskless coordinators. (2) A catching-up
>> consumer's
>> >> performance degrades since it's forced to read data from many small WAL
>> >> files. (3) The data in WAL files could be retained much longer than
>> >> retention time. Since the small WAL files aren't completely deleted
>> until
>> >> all partitions' data in it are obsolete, the deletion of the WAL files
>> >> could be delayed by hours or more. If the WAL file includes a partition
>> >> with a low retention time, the retention contract could be violated
>> >> significantly. The earlier design of the KIP included a separate object
>> >> merging process that combines small WAL files much more aggressively
>> than
>> >> tiered storage, which seems to be a much better choice.
>> >>
>> >> JR2. I don't think the current default partition assignment strategy
>> for
>> >> classic topics works for diskless topics. Current strategy tries to
>> spread
>> >> the replicas to as many brokers as possible. For example, if a broker
>> has
>> >> 100 partitions, their replicas could be spread over 100 brokers. If the
>> >> broker generates a WAL file with 100 partitions, this WAL file will be
>> >> read
>> >> 100 times, once by each broker. S3 read cost is 1/12 of the cost of S3
>> >> put.
>> >> This assignment strategy will increase the S3 cost by about 8X, which
>> is
>> >> prohibitive. We need to design a cost effective assignment strategy for
>> >> diskless topics.
>> >>
>> >> JR3. We need to think through the leade election logic with diskless
>> >> topic.
>> >> The KIP tries to reuse the ISR logic for class topic, but it doesn't
>> seem
>> >> very natural.
>> >> JR3.1 In classsic topic, the leader is always in ISR. In the diskless
>> >> topic, the KIP says that a leader could be out of sync.
>> >> JR3.2 The existing leader election logic based on ISR/ELR mainly
>> retries
>> >> to
>> >> preserve previously acknowledged data. With diskless topics, since the
>> >> object store provides durability, this logic seems no longer needed.
>> The
>> >> existing min.isr and unclean leader election logic also don't apply.
>> >>
>> >> JR4. "Despite that there is no inter-broker replication, replicas will
>> >> still issue FetchRequest to leaders. Leaders will respond with empty
>> (no
>> >> records) FetchResponse."
>> >> This seems unnatural. Could we avoid issuing inter broker fetch
>> requests
>> >> for diskless topics?
>> >>
>> >> JR5. "The replica reassignment will follow the same flow as in classic
>> >> topic:".
>> >> JR5.1 Is this true? Since inter broker fetch response is alway empty,
>> it
>> >> doesn't seem the current reassignment flow works for diskless topic.
>> Also,
>> >> since the source of the data is object store, it seems more natural
>> for a
>> >> replica to back fill the data from the object store, instead of other
>> >> replicas. This will also incur lower costs.
>> >> JR5.2 How do we prevent reassignment on diskless topics from causing
>> the
>> >> same cost issue described in JR2?
>> >>
>> >> JR6." In other functional aspects, diskless topics are
>> indistinguishable
>> >> from classic topics. This includes durability guarantees, ordering
>> >> guarantees, transactional and non-transactional producer API, consumer
>> >> API,
>> >> consumer groups, share groups, data retention (deletion & compact),"
>> >> JR6.1 Could you describe how compact diskless topics are supported?
>> >> JR6.2 Neither this KIP nor KIP-1164 describes the transactional
>> support in
>> >> detail.
>> >>
>> >> JR7. "Object Storage: A shared, durable, concurrent, and eventually
>> >> consistent storage supporting arbitrary sized byte values and a minimal
>> >> set
>> >> of atomic operations: put, delete, list, and ranged get."
>> >> It seems that the object storage in all three major public clouds are
>> >> strongly consistent.
>> >>
>> >> Jun
>> >>
>> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > The parent KIP-1150 was voted for and accepted. Let's now focus on
>> the
>> >> > technical details presented in this KIP-1163 and also in KIP-1164:
>> >> Diskless
>> >> > Coordinator  [1].
>> >> >
>> >> > Best,
>> >> > Ivan
>> >> >
>> >> > [1]
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
>> >> >
>> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
>> >> > > Hi all!
>> >> > >
>> >> > > We want to start the discussion thread for KIP-1163: Diskless Core
>> >> [1],
>> >> > which is a sub-KIP for KIP-1150 [2].
>> >> > >
>> >> > > Let's use the main KIP-1150 discuss thread [3] for high-level
>> >> questions,
>> >> > motivation, and general direction of the feature and this thread for
>> >> > particular details of implementation.
>> >> > >
>> >> > > Best,
>> >> > > Ivan
>> >> > >
>> >> > > [1]
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
>> >> > > [2]
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
>> >> > > [3]
>> https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] KIP-1163: Diskless Core

Reply via email to