Re: [DISCUSS] KIP-1150 Diskless Topics

Jun Rao via dev Mon, 12 Jan 2026 10:38:43 -0800

Hi, Anatolii,

Thanks for the updated KIP. Looks good to me overall. A few more comments.


JR10. "Even in the case where network traffic is not accounted; we believe
that the operational benefits of Diskless topics are still appealing to
Kafka users."
Could you list other benefits such as better scalability and durability?

JR11. "AWS charges 0.02$ per GiB"
The object store is not free and has a different cost model based on
requests. It would be useful summarize the tradeoff on cost and latency,
and when diskless topics are cost effective.

JR12. Are the followup KIPs up to date? KIP-1164 and KIP-1181 haven't been
updated in the last 6 months. The metadata change in KIP-1181 is now
included in KIP-1163.

Jun

On Fri, Jan 9, 2026 at 5:09 AM Anatolii Popov via dev <[email protected]>
wrote:

> Hi all,
>
> Thank you for the extensive feedback.
>
> We have substantially updated the KIP to address the points raised. Given
> the scope of these changes, we have started a new VOTE thread to restart
> the voting process cleanly.
>
> You can find the new thread here:
> https://lists.apache.org/thread/m42nj2qm5z4w2x8kt7x4kgghfzrdwl7q
>
> Best,
> Anatolii
>
> On Mon, Dec 15, 2025 at 9:47 PM Thomas Thornton via dev <
> [email protected]> wrote:
>
> > Hi all,
> >
> > We (the team at Slack) have been following the recent discussion
> regarding
> > the scope and timeline of KIP-1150. We agree with the community sentiment
> > that Diskless Topics represents the right long-term architecture for
> Kafka
> > in the cloud, but we also recognize the valid concerns raised regarding
> the
> > engineering resources required to deliver such an ambitious change.
> >
> > To help address these concerns and accelerate the timeline, we are happy
> to
> > announce that we are partnering with the KIP-1150 authors to co-develop
> > this feature.
> >
> > We previously proposed KIP-1176: Tiered Storage for Active Log Segments
> to
> > solve similar problems. However, rather than fragmenting the community's
> > efforts across competing designs, we have decided to withdraw KIP-1176
> and
> > consolidate our engineering resources behind KIP-1150.
> >
> > To start, we plan to take ownership of Compaction for Tiered Storage.
> This
> > has long been a missing feature in KIP-405, and it becomes critical in a
> > Diskless architecture where long-lived data is obligated to be tiered. By
> > driving this related prerequisite feature, we hope to allow for faster
> > delivery of KIP-1150.
> >
> > We are excited to collaborate on this to ensure a robust and timely
> > delivery for the community.
> >
> > Best,
> > Tom & Henry
> >
> > On Fri, Nov 14, 2025 at 8:35 AM Luke Chen <[email protected]> wrote:
> >
> > > Hi Greg,
> > >
> > > Thanks for sharing the meeting notes.
> > > I agree we should keep polishing the contents of 1150 & high level
> design
> > > in 1163 to prepare for a vote.
> > >
> > >
> > > Thanks.
> > > Luke
> > >
> > > On Fri, Nov 14, 2025 at 3:54 AM Greg Harris
> <[email protected]
> > >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > There was a video call between myself, Ivan Yurchenko, Jun Rao, and
> > > Andrew
> > > > Schofield pertaining to KIP-1150. Here are the notes from that
> meeting:
> > > >
> > > > Ivan: What is the future state of Kafka in this area, in 5 years?
> > > > Jun: Do we want something more cloud native? Yes, started with Tiered
> > > > Storage. If there’s a better way, we should explore it. In the long
> > term
> > > > this will be useful
> > > > Because Kafka is used so widely, we need to make sure everything we
> add
> > > is
> > > > for the long term and for everyone, not just for a single company.
> > > > When we add TS, it doesn’t just solve Uber’s use-case. We want
> > something
> > > > that’s high quality/lasts/maintainable, and can work with all
> existing
> > > > capabilities.
> > > > If both 1150 and 1176 proceed at the same time, it’s confusing. They
> > > > overlap, but Diskless is more ambitious.
> > > > If both KIPs are being seriously worked on, then we don’t really need
> > > both,
> > > > because Diskless clearly is better. Having multiple will confuse
> > people.
> > > It
> > > > will duplicate some of the effort.
> > > > If we want diskless ultimately, what is the short term strategy, to
> get
> > > > some early wins first?
> > > > Ivan: Andrew, do you want a more revolutionary approach?
> > > > Andrew: Eventually the architecture will change substantially, it may
> > not
> > > > be necessary to put all of that bill onto Diskless at once.
> > > > Greg: We all agree on having a high quality feature merged upstream,
> > and
> > > > supporting all APIs
> > > > Jun: We should try and keep things simple, but there is some minimum
> > > > complexity needed.
> > > > When doing the short term changes (1176), it doesn’t really progress
> in
> > > > changing to a more modern architecture.
> > > > Greg: Was TS+Compaction the only feature miss we’ve had so far?
> > > > Jun: The danger of only applying changes to some part of the API, you
> > set
> > > > the precedence that you only have to implement part of the API.
> > > Supporting
> > > > the full API set should be a minimum requirement.
> > > > Andrew: When we started Kraft, how much did we know the design?
> > > > Jun: For Kraft we didn’t really know much about the migration, but
> the
> > > > high-level was clear.
> > > > Greg: Is 1150 votable in its current state?
> > > > Jun: 1150 should promise to support all APIs. It doesn’t have to have
> > all
> > > > the details/apis/etc. KIP-500 didn’t have it.
> > > > We do need some high-level design enough to give confidence that the
> > > > promise is able to be fulfilled.
> > > > Greg: Is the draft version in 1163 enough detail or is more needed?
> > > > Jun: We need to agree on the core design, such as leaderless etc. And
> > how
> > > > the APIs will be supported.
> > > > Greg: Okay we can include these things, and provide a sketch of how
> the
> > > > other leader-based features operate.
> > > > Jun: Yeah if at a high level the sketch appears to work, we can
> approve
> > > > that functionality.
> > > > Are you committed to doing the more involved and big project?
> > > > Greg: Yes, we’re committed to the 1163 design and can’t really accept
> > > 1176.
> > > > Jun: TS was slow because of Uber resourcing problems
> > > > Greg: We’ll push internally for resources, and use the community
> > > sentiment
> > > > to motivate Aiven.
> > > > How far into the future should we look? What sort of scale?
> > > > Jun: As long as there’s a path forward, and we’re not closing off
> > future
> > > > improvements, we can figure out how to handle a larger scale when it
> > > > arises.
> > > > Greg: Random replica placement is very harmful, can we recommend
> users
> > to
> > > > use an external tool like CruiseControl?
> > > > Jun: Not everyone uses CruiseControl, we would probably need some
> > > solution
> > > > for this out of the box
> > > > Ivan: Should the Batch Coordinator be pluggable?
> > > > Jun: Out-of-box experience should be good, good to allow other
> > > > implementations
> > > > Greg: But it could hurt Kafka feature/upgrade velocity when we wait
> for
> > > > plugin providers to implement it
> > > > Ivan: We imagined that maybe cloud hyperscalers could implement it
> with
> > > > e.g. dynamodb
> > > > Greg: Could we bake more details of the different providers into
> Kafka,
> > > or
> > > > does it still make sense for it to be pluggable?
> > > > Jun: Make it whatever is easiest to roll out and add new clients
> > > > Andrew: What happens next? Do you want to get KIP-1150 voted?
> > > > Ivan: The vote is already open, we’re not too pressed for time. We’ll
> > go
> > > > improve the 1163 design and communication.
> > > > Is 1176 a competing design? Someone will ask.
> > > > Jun: If we are seriously working on something more ambitious, yeah we
> > > > shouldn’t do the stop-gap solution.
> > > > It’s diverting review resources. If we can get the short term thing
> in
> > > 1yr
> > > > but Diskless solution is 2y it makes sense to go for Diskless. If
> it’s
> > > 5yr,
> > > > that’s different and maybe the stop-gap solution is needed.
> > > > Greg: I’m biased but I believe we’re in the 1yr/2yr case. Should we
> > > > explicitly exclude 1176?
> > > > Andrew: Put your arms around the feature set you actually want, and
> use
> > > > that to rule out 1176.
> > > > Probably don’t need -1 votes, most likely KIPs just don’t receive
> > votes.
> > > > Ivan: Should we have sync meetings like tiered storage did?
> > > > Jun: Satish posted meeting notes regularly, we should do the same.
> > > >
> > > > To summarize, we will be polishing the contents of 1150 & high level
> > > design
> > > > in 1163 to prepare for a vote.
> > > > We believe that the community should select the feature set of 1150
> to
> > > > fully eliminate producer cross-zone costs, and make the investment
> in a
> > > > high quality Diskless Topics implementation rather than in stop-gap
> > > > solutions.
> > > >
> > > > Thanks,
> > > > Greg
> > > >
> > > > On Fri, Nov 7, 2025 at 9:19 PM Max fortun <[email protected]> wrote:
> > > >
> > > > > This may be a tangent, but we needed to offload storage off of
> Kafka
> > > into
> > > > > S3. We are keeping Kafka not as a source of truth, but as a mostly
> > > > > ephemeral broker that can come and go as it pleases. Be that
> scaling
> > or
> > > > > outage. Disks can be destroyed and recreated at will, we still
> retain
> > > > data
> > > > > and use broker for just that, brokering messages. Not only that, we
> > > > reduced
> > > > > the requirement on the actual Kafka resources by reducing the size
> > of a
> > > > > payload via a claim check pattern. Maybe this is an anti–pattern,
> but
> > > it
> > > > is
> > > > > super fast and highly cost efficient. We reworked ProducerRequest
> to
> > > > allow
> > > > > plugins. We added a custom http plugin that submits every request
> > via a
> > > > > persisted connection to a microservice. Microservice stores the
> > payload
> > > > and
> > > > > returns a tiny json metadata object,a claim check, that can be used
> > to
> > > > find
> > > > > the actual data. Think of it as zipping the payload. This claim
> check
> > > > > metadata traverses the pipelines with consumers using the urls in
> > > > metadata
> > > > > to pull what they need. Think unzipping. This allowed us to also
> pull
> > > > ONLY
> > > > > the data that we need in graphql like manner. So if you have a 100K
> > > json
> > > > > payload and you need only a subsection, you can pull that by
> > jmespath.
> > > > When
> > > > > you have multiple consumer groups yanking down huge payloads it is
> > > > > cumbersome on the broker. When you have the same consumer groups
> > > yanking
> > > > > down a claim check, and then going out of band directly to the
> source
> > > of
> > > > > truth, the broker has some breathing room. Obviously our
> microservice
> > > > does
> > > > > not go directly to the cloud storage, as that would be too slow. It
> > > > stores
> > > > > the payload in high speed memory cache and returns asap. That
> memory
> > is
> > > > > eventually persisted into S3. The retrieval goest against the cache
> > > > first,
> > > > > then against the S3. Overall a rather cheappy and zippy solution. I
> > > tried
> > > > > proposing the KIP for this, but there was no excitement. Check this
> > > out:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=318606528
> > > > >
> > > > >
> > > > > > On Nov 7, 2025, at 5:49 PM, Jun Rao <[email protected]>
> > > wrote:
> > > > > >
> > > > > > Hi, Andrew,
> > > > > >
> > > > > > If we want to focus only on reducing cross-zone replication
> costs,
> > > > there
> > > > > is
> > > > > > an alternative design in the KIP-1176 discussion thread that
> seems
> > > > > simpler
> > > > > > than the proposal here. I am copying the outline of that approach
> > > > below.
> > > > > >
> > > > > > 1. A new leader is elected.
> > > > > > 2. Leader maintains a first tiered offset, which is initialized
> to
> > > log
> > > > > end
> > > > > > offset.
> > > > > > 3. Leader writes produced data from the client to local log.
> > > > > > 4. Leader uploads produced data from all local logs as a combined
> > > > object
> > > > > > 5. Leader stores the metadata for the combined object in memory.
> > > > > > 6. If a follower fetch request has an offset >= first tiered
> > offset,
> > > > the
> > > > > > metadata for the corresponding combined object is returned.
> > > Otherwise,
> > > > > the
> > > > > > local data is returned.
> > > > > > 7. Leader periodically advances first tiered offset.
> > > > > >
> > > > > > It's still a bit unnatural, but it could work.
> > > > > >
> > > > > > Hi, Ivan,
> > > > > >
> > > > > > Are you still committed to proceeding with the original design of
> > > > > KIP-1150?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Sun, Nov 2, 2025 at 6:00 AM Andrew Schofield <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi,
> > > > > >> I’ve been following KIP-1150 and friends for a while. I’m going
> to
> > > > jump
> > > > > >> into the discussions too.
> > > > > >>
> > > > > >> Looking back at Jack Vanlightly’s message, I am not quite so
> > > convinced
> > > > > >> that it’s a kind of fork in the road. The primary aim of the
> > effort
> > > is
> > > > > to
> > > > > >> reduce cross-zone replication costs so Apache Kafka is not
> > > > prohibitively
> > > > > >> expensive to use on cloud storage. I think it would be entirely
> > > viable
> > > > > to
> > > > > >> prioritise code reuse for an initial implementation of diskless
> > > > topics,
> > > > > and
> > > > > >> we could still have a more cloud-native design in the future.
> It’s
> > > > hard
> > > > > to
> > > > > >> predict what the community will prioritise in the future.
> > > > > >>
> > > > > >> Of the three major revisions, I’m in the rev3 camp. We can
> support
> > > > > >> leaderless produce requests, first writing WAL segments into
> > object
> > > > > >> storage, and then using the regular partition leaders to
> sequence
> > > the
> > > > > >> records. The active log segment for a diskless topic will
> > initially
> > > > > contain
> > > > > >> batch coordinates rather than record batches. The batch
> > coordinates
> > > > can
> > > > > be
> > > > > >> resolved from WAL segments for consumers, and also in order to
> > > prepare
> > > > > log
> > > > > >> segments for uploading to tiered storage. Jun is probably
> correct
> > > that
> > > > > we
> > > > > >> need a more frequent object merging process than tiered storage
> > > > > provides.
> > > > > >> This is just the transition from write-optimised WAL segments to
> > > > > >> read-optimised tiered segments, and all of the object
> > storage-based
> > > > > >> implementations of Kafka that I’m aware of do this
> rearrangement.
> > > But
> > > > > >> perhaps this more frequent object merging is a pre-GA
> improvement,
> > > > > rather
> > > > > >> than a strict requirement for an initial implementation for
> early
> > > > access
> > > > > >> use.
> > > > > >>
> > > > > >> For zone-aligned share consumers, the share group assignor is
> > > intended
> > > > > to
> > > > > >> be rack-aware. Consumers should be assigned to partitions with
> > > leaders
> > > > > in
> > > > > >> their zone. The simple assignor is not rack-aware, but it easily
> > > could
> > > > > be
> > > > > >> or we could have a rack-aware assignor.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Andrew
> > > > > >>
> > > > > >>
> > > > > >>> On 24 Oct 2025, at 23:14, Jun Rao <[email protected]>
> > > wrote:
> > > > > >>>
> > > > > >>> Hi, Ivan,
> > > > > >>>
> > > > > >>> Thanks for the reply.
> > > > > >>>
> > > > > >>> "As I understand, you’re speaking about locally materialized
> > > > segments.
> > > > > >> They
> > > > > >>> will indeed consume some IOPS. See them as a cache that could
> > > always
> > > > be
> > > > > >>> restored from the remote storage. While it’s not ideal, it's
> > still
> > > OK
> > > > > to
> > > > > >>> lose data in them due to a machine crash, for example. Because
> of
> > > > this,
> > > > > >> we
> > > > > >>> can avoid explicit flushing on local materialized segments at
> all
> > > and
> > > > > let
> > > > > >>> the file system and page cache figure out when to flush
> > optimally.
> > > > This
> > > > > >>> would not eliminate the extra IOPS, but should reduce it
> > > > dramatically,
> > > > > >>> depending on throughput for each partition. We, of course,
> > continue
> > > > > >>> flushing the metadata segments as before."
> > > > > >>>
> > > > > >>> If we have a mix of classic and diskless topics on the same
> > broker,
> > > > > it's
> > > > > >>> important that the classic topics' data is flushed to disk as
> > > quickly
> > > > > as
> > > > > >>> possible. To achieve this, users typically set
> > > dirty_expire_centisecs
> > > > > in
> > > > > >>> the kernel based on the number of available disk IOPS. Once you
> > set
> > > > > this
> > > > > >>> number, it applies to all dirty files, including the cached
> data
> > in
> > > > > >>> diskless topics. So, if there are more files actively
> > accumulating
> > > > > data,
> > > > > >>> the flush frequency and therefore RPO is reduced for classic
> > > topics.
> > > > > >>>
> > > > > >>> "We should have mentioned this explicitly, but this step, in
> > fact,
> > > > > >> remains
> > > > > >>> in the form of segments offloading to tiered storage. When we
> > > > assemble
> > > > > a
> > > > > >>> segment and hand it over to RemoteLogManager, we’re effectively
> > > doing
> > > > > >>> metadata compaction: replacing a big number of pieces of
> metadata
> > > > about
> > > > > >>> individual batches with a single record in
> > __remote_log_metadata."
> > > > > >>>
> > > > > >>> The object merging in tier storage typically only kicks in
> after
> > a
> > > > few
> > > > > >>> hours. The impact is (1) the amount of accumulated metadata is
> > > still
> > > > > >> quite
> > > > > >>> large; (2) there are many small objects, leading to poor read
> > > > > >> performance.
> > > > > >>> I think we need a more frequent object merging process than
> tier
> > > > > storage
> > > > > >>> provides.
> > > > > >>>
> > > > > >>> Jun
> > > > > >>>
> > > > > >>>
> > > > > >>> On Thu, Oct 23, 2025 at 10:12 AM Ivan Yurchenko <
> [email protected]>
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Hello Jack, Jun, Luke, and all!
> > > > > >>>>
> > > > > >>>> Thank you for your messages.
> > > > > >>>>
> > > > > >>>> Let me first address some of Jun’s comments.
> > > > > >>>>
> > > > > >>>>> First, it degrades the durability.
> > > > > >>>>> For each partition, now there are two files being actively
> > > written
> > > > > at a
> > > > > >>>>> given point of time, one for the data and another for the
> > > metadata.
> > > > > >>>>> Flushing each file requires a separate IO. If the disk has 1K
> > > IOPS
> > > > > and
> > > > > >> we
> > > > > >>>>> have 5K partitions in a broker, currently we can afford to
> > flush
> > > > each
> > > > > >>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If
> we
> > > > > double
> > > > > >>>> the
> > > > > >>>>> number of files per partition, we can only flush each
> partition
> > > > every
> > > > > >> 10
> > > > > >>>>> seconds, which makes RPO twice as bad.
> > > > > >>>>
> > > > > >>>> As I understand, you’re speaking about locally materialized
> > > > segments.
> > > > > >> They
> > > > > >>>> will indeed consume some IOPS. See them as a cache that could
> > > always
> > > > > be
> > > > > >>>> restored from the remote storage. While it’s not ideal, it's
> > still
> > > > OK
> > > > > to
> > > > > >>>> lose data in them due to a machine crash, for example. Because
> > of
> > > > > this,
> > > > > >> we
> > > > > >>>> can avoid explicit flushing on local materialized segments at
> > all
> > > > and
> > > > > >> let
> > > > > >>>> the file system and page cache figure out when to flush
> > optimally.
> > > > > This
> > > > > >>>> would not eliminate the extra IOPS, but should reduce it
> > > > dramatically,
> > > > > >>>> depending on throughput for each partition. We, of course,
> > > continue
> > > > > >>>> flushing the metadata segments as before.
> > > > > >>>>
> > > > > >>>> It’s worth making a note on caching. I think nobody will
> > disagree
> > > > that
> > > > > >>>> doing direct reads from remote storage every time a batch is
> > > > requested
> > > > > >> by a
> > > > > >>>> consumer will not be practical neither from the performance
> nor
> > > from
> > > > > the
> > > > > >>>> economy point of view. We need a way to keep the number of GET
> > > > > requests
> > > > > >>>> down. There are multiple options, for example:
> > > > > >>>> 1. Rack-aware distributed in-memory caching.
> > > > > >>>> 2. Local in-memory caching. Comes with less network chattiness
> > and
> > > > > >> works
> > > > > >>>> well if we have more or less stable brokers to consume from.
> > > > > >>>> 3. Materialization of diskless logs on local disk. Way lower
> > > impact
> > > > on
> > > > > >>>> RAM and also requires stable brokers for consumption (using
> just
> > > > > >> assigned
> > > > > >>>> replicas will probably work well).
> > > > > >>>>
> > > > > >>>> Materialization is one of possible options, but we can choose
> > > > another
> > > > > >> one.
> > > > > >>>> However, we will have this dilemma regardless of whether we
> have
> > > an
> > > > > >>>> explicit coordinator or we go “coordinator-less”.
> > > > > >>>>
> > > > > >>>>> Second, if we ever need this
> > > > > >>>>> metadata somewhere else, say in the WAL file manager, the
> > > consumer
> > > > > >> needs
> > > > > >>>> to
> > > > > >>>>> subscribe to every partition in the cluster, which is
> > > inefficient.
> > > > > The
> > > > > >>>>> actual benefit of this approach is also questionable. On the
> > > > surface,
> > > > > >> it
> > > > > >>>>> might seem that we could reduce the number of lines that need
> > to
> > > be
> > > > > >>>> changed
> > > > > >>>>> for this KIP. However, the changes are quite intrusive to the
> > > > classic
> > > > > >>>>> partition's code path and will probably make the code base
> > harder
> > > > to
> > > > > >>>>> maintain in the long run. I like the original approach based
> on
> > > the
> > > > > >> batch
> > > > > >>>>> coordinator much better than this one. We could probably
> > refactor
> > > > the
> > > > > >>>>> producer state code so that it could be reused in the batch
> > > > > >> coordinator.
> > > > > >>>>
> > > > > >>>> It’s hard to disagree with this. The explicit coordinator is
> > more
> > > a
> > > > > side
> > > > > >>>> thing, while coordinator-less approach is more about extending
> > > > > >>>> ReplicaManager, UnifiedLog and others substantially.
> > > > > >>>>
> > > > > >>>>> Thanks for addressing the concerns on the number of RPCs in
> the
> > > > > produce
> > > > > >>>>> path. I agree that with the metadata crafting mechanism, we
> > could
> > > > > >>>> mitigate
> > > > > >>>>> the PRC problem. However, since we now require the metadata
> to
> > be
> > > > > >>>>> collocated with the data on the same set of brokers, it's
> weird
> > > > that
> > > > > >> they
> > > > > >>>>> are now managed by different mechanisms. The data assignment
> > now
> > > > uses
> > > > > >> the
> > > > > >>>>> metadata crafting mechanism, but the metadata is stored in
> the
> > > > > classic
> > > > > >>>>> partition using its own assignment strategy. It will be
> > > complicated
> > > > > to
> > > > > >>>> keep
> > > > > >>>>> them collocated.
> > > > > >>>>
> > > > > >>>> I would like to note that the metadata crafting is needed only
> > to
> > > > tell
> > > > > >>>> producers which brokers they should send Produce requests to,
> > but
> > > > data
> > > > > >> (as
> > > > > >>>> in “locally materialized log”) is located on partition
> replicas,
> > > > i.e.
> > > > > >>>> automatically co-located with metadata.
> > > > > >>>>
> > > > > >>>> As a side note, it would probably be better that instead of
> > > > implicitly
> > > > > >>>> crafting partition metadata, we extend the metadata protocol
> so
> > > that
> > > > > for
> > > > > >>>> diskless partitions we return not only the leader and
> replicas,
> > > but
> > > > > also
> > > > > >>>> some “recommended produce brokers”, selected for optimal
> > > performance
> > > > > and
> > > > > >>>> costs. Producers will pick ones in their racks.
> > > > > >>>>
> > > > > >>>>> I am also concerned about the removal of the object
> > > > > compaction/merging
> > > > > >>>>> step.
> > > > > >>>>
> > > > > >>>> We should have mentioned this explicitly, but this step, in
> > fact,
> > > > > >> remains
> > > > > >>>> in the form of segments offloading to tiered storage. When we
> > > > > assemble a
> > > > > >>>> segment and hand it over to RemoteLogManager, we’re
> effectively
> > > > doing
> > > > > >>>> metadata compaction: replacing a big number of pieces of
> > metadata
> > > > > about
> > > > > >>>> individual batches with a single record in
> > __remote_log_metadata.
> > > > > >>>>
> > > > > >>>> We could create a Diskless-specific merging mechanism instead
> if
> > > > > needed.
> > > > > >>>> It’s rather easy with the explicit coordinator approach. With
> > the
> > > > > >>>> coordinator-less approach, this would probably be a bit more
> > > tricky
> > > > > >>>> (rewriting the tail of the log by the leader + replicating
> this
> > > > change
> > > > > >>>> reliably).
> > > > > >>>>
> > > > > >>>>> I see a tendency toward primarily optimizing for the fewest
> > code
> > > > > >> changes
> > > > > >>>> in
> > > > > >>>>> the KIP. Instead, our primary goal should be a clean design
> > that
> > > > can
> > > > > >> last
> > > > > >>>>> for the long term.
> > > > > >>>>
> > > > > >>>> Yes, totally agree.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Luke,
> > > > > >>>>> I'm wondering if the complexity of designing txn and queue is
> > > > because
> > > > > >> of
> > > > > >>>>> leaderless cluster, do you think it will be simpler if we
> only
> > > > focus
> > > > > on
> > > > > >>>> the
> > > > > >>>>> "diskless" design to handle object compaction/merging to/from
> > the
> > > > > >> remote
> > > > > >>>>> storage to save the cross-AZ cost?
> > > > > >>>>
> > > > > >>>> After some evolution of the original proposal, leaderless is
> now
> > > > > >> limited.
> > > > > >>>> We only need to be able to accept Produce requests on more
> than
> > > one
> > > > > >> broker
> > > > > >>>> to eliminate the cross-AZ costs for producers. Do I get it
> right
> > > > that
> > > > > >> you
> > > > > >>>> propose to get rid of this? Or do I misunderstand?
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Let’s now look at this problem from a higher level, as Jack
> > > > proposed.
> > > > > As
> > > > > >>>> it was said, the big choice we need to make is whether we 1)
> > > create
> > > > an
> > > > > >>>> explicit batch coordinator; or 2) go for the coordinator-less
> > > > > approach,
> > > > > >>>> where each diskless partition is managed by its leader as in
> > > classic
> > > > > >> topics.
> > > > > >>>>
> > > > > >>>> If we try to compare the two approaches:
> > > > > >>>>
> > > > > >>>> Pluggability:
> > > > > >>>> - Explicit coordinator: Possible. For example, some setups may
> > > > benefit
> > > > > >>>> from batch metadata being stored in a cloud database (such as
> > AWS
> > > > > >> DynamoDB
> > > > > >>>> or GCP Spanner).
> > > > > >>>> - Coordinator-less: Impossible.
> > > > > >>>>
> > > > > >>>> Scalability and fault tolerance:
> > > > > >>>> - Explicit coordinator: Depends on the implementation and it’s
> > > also
> > > > > >>>> necessary to actively work for it.
> > > > > >>>> - Coordinator-less: Closer to classic Kafka topics. Scaling is
> > > done
> > > > by
> > > > > >>>> partition placement, partitions could fail independently.
> > > > > >>>>
> > > > > >>>> Separation of concerns:
> > > > > >>>> - Explicit coordinator: Very good. Diskless remains more
> > > independent
> > > > > >> from
> > > > > >>>> classic topics in terms of code and workflows. For example,
> the
> > > > > >>>> above-mentioned non-tiered storage metadata compaction
> mechanism
> > > > could
> > > > > >> be
> > > > > >>>> relatively simply implemented with it. As a flip side of this,
> > > some
> > > > > >>>> workflows (e.g. transactions) will have to be adapted.
> > > > > >>>> - Coordinator-less: Less so. It leans to the opposite:
> bringing
> > > > > diskless
> > > > > >>>> closer to classic topics. Some code paths and workflows could
> be
> > > > more
> > > > > >>>> straightforwardly reused, but they will inevitably have to be
> > > > adapted
> > > > > to
> > > > > >>>> accommodate both topic types as also discussed.
> > > > > >>>>
> > > > > >>>> Cloud-nativeness. This is a vague concept, also related to the
> > > > > previous,
> > > > > >>>> but let’s try:
> > > > > >>>> - Explicit coordinator: Storing and processing metadata
> > separately
> > > > > makes
> > > > > >>>> it easier for brokers to take different roles, be purely
> > stateless
> > > > if
> > > > > >>>> needed, etc.
> > > > > >>>> - Coordinator-less: Less so. Something could be achieved with
> > > > creative
> > > > > >>>> partition placement, but not much.
> > > > > >>>>
> > > > > >>>> Both seem to have their pros and cons. However, answering
> Jack’s
> > > > > >> question,
> > > > > >>>> the explicit coordinator approach may indeed lead to a more
> > > flexible
> > > > > >> design.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> The purpose of this deviation in the discussion was to
> receive a
> > > > > >>>> preliminary community evaluation of the coordinator-less
> > approach
> > > > > >> without
> > > > > >>>> taking on the task of writing a separate KIP and fitting it in
> > the
> > > > > >> system
> > > > > >>>> of KIP-1150 and its children. We’re open to stopping it and
> > > getting
> > > > > >> back to
> > > > > >>>> working out the coordinator design if the community doesn’t
> > favor
> > > > the
> > > > > >>>> proposed approach.
> > > > > >>>>
> > > > > >>>> Best,
> > > > > >>>> Ivan and Diskless team
> > > > > >>>>
> > > > > >>>> On Mon, Oct 20, 2025, at 05:58, Luke Chen wrote:
> > > > > >>>>> Hi Ivan,
> > > > > >>>>>
> > > > > >>>>> As Jun pointed out, the updated design seems to have some
> > > > > shortcomings
> > > > > >>>>> although it simplifies the implementation.
> > > > > >>>>>
> > > > > >>>>> I'm wondering if the complexity of designing txn and queue is
> > > > because
> > > > > >> of
> > > > > >>>>> leaderless cluster, do you think it will be simpler if we
> only
> > > > focus
> > > > > on
> > > > > >>>> the
> > > > > >>>>> "diskless" design to handle object compaction/merging to/from
> > the
> > > > > >> remote
> > > > > >>>>> storage to save the cross-AZ cost?
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> Thank you,
> > > > > >>>>> Luke
> > > > > >>>>>
> > > > > >>>>> On Sat, Oct 18, 2025 at 5:22 AM Jun Rao
> > <[email protected]
> > > >
> > > > > >>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hi, Ivan,
> > > > > >>>>>>
> > > > > >>>>>> Thanks for the explanation.
> > > > > >>>>>>
> > > > > >>>>>> "we write the reference to the WAL file with the batch data"
> > > > > >>>>>>
> > > > > >>>>>> I understand the approach now, but I think it is a hacky
> one.
> > > > There
> > > > > >> are
> > > > > >>>>>> multiple short comings with this design. First, it degrades
> > the
> > > > > >>>> durability.
> > > > > >>>>>> For each partition, now there are two files being actively
> > > written
> > > > > at
> > > > > >> a
> > > > > >>>>>> given point of time, one for the data and another for the
> > > > metadata.
> > > > > >>>>>> Flushing each file requires a separate IO. If the disk has
> 1K
> > > IOPS
> > > > > and
> > > > > >>>> we
> > > > > >>>>>> have 5K partitions in a broker, currently we can afford to
> > flush
> > > > > each
> > > > > >>>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If
> > we
> > > > > double
> > > > > >>>> the
> > > > > >>>>>> number of files per partition, we can only flush each
> > partition
> > > > > every
> > > > > >>>> 10
> > > > > >>>>>> seconds, which makes RPO twice as bad. Second, if we ever
> need
> > > > this
> > > > > >>>>>> metadata somewhere else, say in the WAL file manager, the
> > > consumer
> > > > > >>>> needs to
> > > > > >>>>>> subscribe to every partition in the cluster, which is
> > > inefficient.
> > > > > The
> > > > > >>>>>> actual benefit of this approach is also questionable. On the
> > > > > surface,
> > > > > >>>> it
> > > > > >>>>>> might seem that we could reduce the number of lines that
> need
> > to
> > > > be
> > > > > >>>> changed
> > > > > >>>>>> for this KIP. However, the changes are quite intrusive to
> the
> > > > > classic
> > > > > >>>>>> partition's code path and will probably make the code base
> > > harder
> > > > to
> > > > > >>>>>> maintain in the long run. I like the original approach based
> > on
> > > > the
> > > > > >>>> batch
> > > > > >>>>>> coordinator much better than this one. We could probably
> > > refactor
> > > > > the
> > > > > >>>>>> producer state code so that it could be reused in the batch
> > > > > >>>> coordinator.
> > > > > >>>>>>
> > > > > >>>>>> Thanks for addressing the concerns on the number of RPCs in
> > the
> > > > > >> produce
> > > > > >>>>>> path. I agree that with the metadata crafting mechanism, we
> > > could
> > > > > >>>> mitigate
> > > > > >>>>>> the PRC problem. However, since we now require the metadata
> to
> > > be
> > > > > >>>>>> collocated with the data on the same set of brokers, it's
> > weird
> > > > that
> > > > > >>>> they
> > > > > >>>>>> are now managed by different mechanisms. The data assignment
> > now
> > > > > uses
> > > > > >>>> the
> > > > > >>>>>> metadata crafting mechanism, but the metadata is stored in
> the
> > > > > classic
> > > > > >>>>>> partition using its own assignment strategy. It will be
> > > > complicated
> > > > > to
> > > > > >>>> keep
> > > > > >>>>>> them collocated.
> > > > > >>>>>>
> > > > > >>>>>> I am also concerned about the removal of the object
> > > > > compaction/merging
> > > > > >>>>>> step. My first concern is on the amount of metadata that
> need
> > to
> > > > be
> > > > > >>>> kept.
> > > > > >>>>>> Without object compcation, the metadata generated in the
> > produce
> > > > > path
> > > > > >>>> can
> > > > > >>>>>> only be deleted after remote tiering kicks in. Let's say for
> > > every
> > > > > >>>> 250ms we
> > > > > >>>>>> produce 100 byte of metadata per partition. Let's say
> remoting
> > > > > tiering
> > > > > >>>>>> kicks in after 5 hours. In a cluster with 100K partitions,
> we
> > > need
> > > > > to
> > > > > >>>> keep
> > > > > >>>>>> about 100 * (1 / 0.2)  * 5 * 3600 * 100K = 720 GB metadata,
> > > quite
> > > > > >>>>>> signficant. A second concern is on performance. Every time
> we
> > > need
> > > > > to
> > > > > >>>>>> rebuild the caching data, we need to read a bunch of small
> > > objects
> > > > > >>>> from S3,
> > > > > >>>>>> slowing down the building process. If a consumer happens to
> > need
> > > > > such
> > > > > >>>> data,
> > > > > >>>>>> it could slow down the application.
> > > > > >>>>>>
> > > > > >>>>>> I see a tendency toward primarily optimizing for the fewest
> > code
> > > > > >>>> changes in
> > > > > >>>>>> the KIP. Instead, our primary goal should be a clean design
> > that
> > > > can
> > > > > >>>> last
> > > > > >>>>>> for the long term.
> > > > > >>>>>>
> > > > > >>>>>> Thanks,
> > > > > >>>>>>
> > > > > >>>>>> Jun
> > > > > >>>>>>
> > > > > >>>>>> On Tue, Oct 14, 2025 at 11:02 AM Ivan Yurchenko <
> > [email protected]
> > > >
> > > > > >>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hi Jun,
> > > > > >>>>>>>
> > > > > >>>>>>> Thank you for your message. I’m sorry that I failed to
> > clearly
> > > > > >>>> explain
> > > > > >>>>>> the
> > > > > >>>>>>> idea. Let me try to fix this.
> > > > > >>>>>>>
> > > > > >>>>>>>> Does each partition now have a metadata partition and a
> > > separate
> > > > > >>>> data
> > > > > >>>>>>>> partition? If so, I am concerned that it essentially
> doubles
> > > the
> > > > > >>>> number
> > > > > >>>>>>> of
> > > > > >>>>>>>> partitions, which impacts the number of open file
> > descriptors
> > > > and
> > > > > >>>> the
> > > > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a
> > > > > separate
> > > > > >>>>>>>> partition just to store the metadata. It's as if we are
> > > creating
> > > > > an
> > > > > >>>>>>>> internal topic with an unbounded number of partitions.
> > > > > >>>>>>>
> > > > > >>>>>>> No. There will be only one physical partition per diskless
> > > > > >>>> partition. Let
> > > > > >>>>>>> me explain this with an example. Let’s say we have a
> diskless
> > > > > >>>> partition
> > > > > >>>>>>> topic-0. It has three replicas 0, 1, 2; 0 is the leader. We
> > > > produce
> > > > > >>>> some
> > > > > >>>>>>> batches to this partition. The content of the segment file
> > will
> > > > be
> > > > > >>>>>>> something like this (for each batch):
> > > > > >>>>>>>
> > > > > >>>>>>> BaseOffset: 00000000000000000000 (like in classic)
> > > > > >>>>>>> Length: 123456 (like in classic)
> > > > > >>>>>>> PartitionLeaderEpoch: like in classic
> > > > > >>>>>>> Magic: like in classic
> > > > > >>>>>>> CRC: like in classic
> > > > > >>>>>>> Attributes: like in classic
> > > > > >>>>>>> LastOffsetDelta: like in classic
> > > > > >>>>>>> BaseTimestamp: like in classic
> > > > > >>>>>>> MaxTimestamp: like in classic
> > > > > >>>>>>> ProducerId: like in classic
> > > > > >>>>>>> ProducerEpoch: like in classic
> > > > > >>>>>>> BaseSequence: like in classic
> > > > > >>>>>>> RecordsCount: like in classic
> > > > > >>>>>>> Records:
> > > > > >>>>>>> path/to/wal/files/5b55c4bb-f52a-4204-aea6-81226895158a;
> byte
> > > > offset
> > > > > >>>>>>> 123456
> > > > > >>>>>>>
> > > > > >>>>>>> It looks very much like classic log entries. The only
> > > difference
> > > > is
> > > > > >>>> that
> > > > > >>>>>>> instead of writing real Records, we write the reference to
> > the
> > > > WAL
> > > > > >>>> file
> > > > > >>>>>>> with the batch data (I guess we need only the name and the
> > byte
> > > > > >>>> offset,
> > > > > >>>>>>> because the byte length is the standard field above).
> > > Otherwise,
> > > > > >>>> it’s a
> > > > > >>>>>>> normal Kafka log with the leader and replicas.
> > > > > >>>>>>>
> > > > > >>>>>>> So we have as many partitions for diskless as for classic.
> As
> > > of
> > > > > open
> > > > > >>>>>> file
> > > > > >>>>>>> descriptors, let’s proceed to the following:
> > > > > >>>>>>>
> > > > > >>>>>>>> Are the metadata and
> > > > > >>>>>>>> the data for the same partition always collocated on the
> > same
> > > > > >>>> broker?
> > > > > >>>>>> If
> > > > > >>>>>>>> so, how do we enforce that when replicas are reassigned?
> > > > > >>>>>>>
> > > > > >>>>>>> The source of truth for the data is still in WAL files on
> > > object
> > > > > >>>> storage.
> > > > > >>>>>>> The source of truth for the metadata is in segment files on
> > the
> > > > > >>>> brokers
> > > > > >>>>>> in
> > > > > >>>>>>> the replica set. Two new mechanisms are planned, both
> > > independent
> > > > > of
> > > > > >>>> this
> > > > > >>>>>>> new proposal, but I want to present them to give the idea
> > that
> > > > > only a
> > > > > >>>>>>> limited amount of data files will be operated locally:
> > > > > >>>>>>> - We want to assemble batches into segment files and
> offload
> > > them
> > > > > to
> > > > > >>>>>>> tiered storage in order to prevent the unbounded growth of
> > > batch
> > > > > >>>>>> metadata.
> > > > > >>>>>>> For this, we need to open only  a few file descriptors (for
> > the
> > > > > >>>> segment
> > > > > >>>>>>> file itself + the necessary indexes) before the segment is
> > > fully
> > > > > >>>> written
> > > > > >>>>>>> and handed over to RemoteLogManager.
> > > > > >>>>>>> - We want to assemble local segment files for caching
> > purposes
> > > as
> > > > > >>>> well,
> > > > > >>>>>>> i.e. to speed up fetching. This will not materialize the
> full
> > > > > >>>> content of
> > > > > >>>>>>> the log, but only the hot set according to some policy (or
> > > > > >>>> configurable
> > > > > >>>>>>> policies), i.e. the number of segments and file descriptors
> > > will
> > > > > >>>> also be
> > > > > >>>>>>> limited.
> > > > > >>>>>>>
> > > > > >>>>>>>> The number of RPCs in the produce path is significantly
> > > higher.
> > > > > For
> > > > > >>>>>>>> example, if a produce request has 100 partitions, in a
> > cluster
> > > > > >>>> with 100
> > > > > >>>>>>>> brokers, each produce request could generate 100 more RPC
> > > > > requests.
> > > > > >>>>>> This
> > > > > >>>>>>>> will significantly increase the request rate.
> > > > > >>>>>>>
> > > > > >>>>>>> This is a valid concern that we considered, but this issue
> > can
> > > be
> > > > > >>>>>>> mitigated. I’ll try to explain the approach.
> > > > > >>>>>>> The situation with a single broker is trivial: all the
> commit
> > > > > >>>> requests go
> > > > > >>>>>>> from the broker to itself.
> > > > > >>>>>>> Let’s scale this to a multi-broker cluster, but located in
> > the
> > > > > single
> > > > > >>>>>> rack
> > > > > >>>>>>> (AZ). Any broker can accept Produce requests for diskless
> > > > > >>>> partitions, but
> > > > > >>>>>>> we can tell producers (through metadata) to always send
> > Produce
> > > > > >>>> requests
> > > > > >>>>>> to
> > > > > >>>>>>> leaders. For example, broker 0 hosts the leader replicas
> for
> > > > > diskless
> > > > > >>>>>>> partitions t1-0, t2-1, t3-0. It will receive diskless
> Produce
> > > > > >>>> requests
> > > > > >>>>>> for
> > > > > >>>>>>> these partitions in various combinations, but only for
> them.
> > > > > >>>>>>>
> > > > > >>>>>>>                   Broker 0
> > > > > >>>>>>>             +-----------------+
> > > > > >>>>>>>             |    t1-0         |
> > > > > >>>>>>>             |    t2-1 <--------------------+
> > > > > >>>>>>>             |    t3-0         |            |
> > > > > >>>>>>> produce      | +-------------+ |            |
> > > > > >>>>>>> requests     | |  diskless   | |            |
> > > > > >>>>>>> --------------->|   produce   +--------------+
> > > > > >>>>>>> for these    | | WAL buffer  | |    commit requests
> > > > > >>>>>>> partitions   | +-------------+ |    for these partitions
> > > > > >>>>>>>             |                 |
> > > > > >>>>>>>             +-----------------+
> > > > > >>>>>>>
> > > > > >>>>>>> The same applies for other brokers in this cluster.
> > > Effectively,
> > > > > each
> > > > > >>>>>>> broker will commit only to itself, which effectively means
> 1
> > > > commit
> > > > > >>>>>> request
> > > > > >>>>>>> per WAL buffer (this may be 0 physical network calls, if we
> > > wish,
> > > > > >>>> just a
> > > > > >>>>>>> local function call).
> > > > > >>>>>>>
> > > > > >>>>>>> Now let’s scale this to multiple racks (AZs). Obviously, we
> > > > cannot
> > > > > >>>> always
> > > > > >>>>>>> send Produce requests to the designated leaders of diskless
> > > > > >>>> partitions:
> > > > > >>>>>>> this would mean inter-AZ network traffic, which we would
> like
> > > to
> > > > > >>>> avoid.
> > > > > >>>>>> To
> > > > > >>>>>>> avoid it, we say that every broker has a “diskless produce
> > > > > >>>>>> representative”
> > > > > >>>>>>> in every AZ. If we continue our example: when a Produce
> > request
> > > > for
> > > > > >>>> t1-0,
> > > > > >>>>>>> t2-1, or t3-0 comes from a producer in AZ 0, it lands on
> > > broker 0
> > > > > >>>> (in the
> > > > > >>>>>>> broker’s AZ the representative is the broker itself).
> > However,
> > > if
> > > > > it
> > > > > >>>>>> comes
> > > > > >>>>>>> from AZ 1, it lands on broker 1; in AZ 2, it’s broker 2.
> > > > > >>>>>>>
> > > > > >>>>>>> |produce requests         |produce requests        |produce
> > > > > >>>> requests
> > > > > >>>>>>> |for t1-0, t2-1, t3-0     |for t1-0, t2-1, t3-0    |for
> t1-0,
> > > > t2-1,
> > > > > >>>>>> t3-0
> > > > > >>>>>>> |from AZ 0                |from AZ 1               |from
> AZ 2
> > > > > >>>>>>> v                         v                        v
> > > > > >>>>>>> Broker 0 (AZ 0)        Broker 1 (AZ 1)        Broker 2 (AZ
> 2)
> > > > > >>>>>>> +---------------+      +---------------+
> > +---------------+
> > > > > >>>>>>> |     t1-0      |      |               |      |
> >  |
> > > > > >>>>>>> |     t2-1      |      |               |      |
> >  |
> > > > > >>>>>>> |     t3-0      |      |               |      |
> >  |
> > > > > >>>>>>> +---------------+      +--------+------+
> > +--------+------+
> > > > > >>>>>>>    ^     ^                    |                      |
> > > > > >>>>>>>    |     +--------------------+                      |
> > > > > >>>>>>>    |     commit requests for these partitions        |
> > > > > >>>>>>>    |                                                 |
> > > > > >>>>>>>    +-------------------------------------------------+
> > > > > >>>>>>>          commit requests for these partitions
> > > > > >>>>>>>
> > > > > >>>>>>> All the partitions that broker 0 is the leader of will be
> > > > > >>>> “represented”
> > > > > >>>>>> by
> > > > > >>>>>>> brokers 1 and 2 in their AZs.
> > > > > >>>>>>>
> > > > > >>>>>>> Of course, this relationship goes both ways between AZs
> (not
> > > > > >>>> necessarily
> > > > > >>>>>>> between the same brokers). It means that provided the
> cluster
> > > is
> > > > > >>>> balanced
> > > > > >>>>>>> by the number of brokers per AZ, each broker will represent
> > > > > >>>>>> (number_of_azs
> > > > > >>>>>>> - 1) other brokers. This will result in the situation that
> > for
> > > > the
> > > > > >>>>>> majority
> > > > > >>>>>>> of commits, each broker will do up to (number_of_azs - 1)
> > > network
> > > > > >>>> commit
> > > > > >>>>>>> requests (plus one local). Cloud regions tend to have 3
> AZs,
> > > very
> > > > > >>>> rarely
> > > > > >>>>>>> more. That means, brokers will be doing up to 2 network
> > commit
> > > > > >>>> requests
> > > > > >>>>>> per
> > > > > >>>>>>> WAL file.
> > > > > >>>>>>>
> > > > > >>>>>>> There are the following exceptions:
> > > > > >>>>>>> 1. Broker count imbalance between AZs. For example, when we
> > > have
> > > > 2
> > > > > >>>> AZs
> > > > > >>>>>> and
> > > > > >>>>>>> one has three brokers and another AZ has one. This one
> broker
> > > > will
> > > > > do
> > > > > >>>>>>> between 1 and 3 commit requests per WAL file. This is not
> an
> > > > > extreme
> > > > > >>>>>>> amplification. Such an imbalance is not healthy in most
> > > practical
> > > > > >>>> setups
> > > > > >>>>>>> and should be avoided anyway.
> > > > > >>>>>>> 2. Leadership changes and metadata propagation period. When
> > the
> > > > > >>>> partition
> > > > > >>>>>>> t3-0 is relocated from broker 0 to some broker 3, the
> > producers
> > > > > will
> > > > > >>>> not
> > > > > >>>>>>> know this immediately (unless we want to be strict and
> > respond
> > > > with
> > > > > >>>>>>> NOT_LEADER_OR_FOLLOWER). So if t1-0, t2-1, and t3-0 will
> come
> > > > > >>>> together
> > > > > >>>>>> in a
> > > > > >>>>>>> WAL buffer on broker 2, it will have to send two commit
> > > requests:
> > > > > to
> > > > > >>>>>> broker
> > > > > >>>>>>> 0 to commit t1-0 and t2-1, and to broker 3 to commit t3-0.
> > This
> > > > > >>>> situation
> > > > > >>>>>>> is not permanent and as producers update the cluster
> > metadata,
> > > it
> > > > > >>>> will be
> > > > > >>>>>>> resolved.
> > > > > >>>>>>>
> > > > > >>>>>>> This all could be built with the metadata crafting
> mechanism
> > > only
> > > > > >>>> (which
> > > > > >>>>>>> is anyway needed for Diskless in one way or another to
> direct
> > > > > >>>> producers
> > > > > >>>>>> and
> > > > > >>>>>>> consumers where we need to avoid inter-AZ traffic), just
> with
> > > the
> > > > > >>>> right
> > > > > >>>>>>> policy for it (for example, some deterministic hash-based
> > > > formula).
> > > > > >>>> I.e.
> > > > > >>>>>> no
> > > > > >>>>>>> explicit support for “produce representative” or anything
> > like
> > > > this
> > > > > >>>> is
> > > > > >>>>>>> needed on the cluster level, in KRaft, etc.
> > > > > >>>>>>>
> > > > > >>>>>>>> The same WAL file metadata is now duplicated into two
> > places,
> > > > > >>>> partition
> > > > > >>>>>>>> leader and WAL File Manager. Which one is the source of
> > truth,
> > > > and
> > > > > >>>> how
> > > > > >>>>>> do
> > > > > >>>>>>>> we maintain consistency between the two places?
> > > > > >>>>>>>
> > > > > >>>>>>> We do only two operations on WAL files that span multiple
> > > > diskless
> > > > > >>>>>>> partitions: committing and deleting. Commits can be done
> > > > > >>>> independently as
> > > > > >>>>>>> described above. But deletes are different, because when a
> > file
> > > > is
> > > > > >>>>>> deleted,
> > > > > >>>>>>> this affects all the partitions that still have alive
> batches
> > > in
> > > > > this
> > > > > >>>>>> file
> > > > > >>>>>>> (if any).
> > > > > >>>>>>>
> > > > > >>>>>>> The WAL file manager is a necessary point of coordination
> to
> > > > delete
> > > > > >>>> WAL
> > > > > >>>>>>> files safely. We can say it is the source of truth about
> > files
> > > > > >>>>>> themselves,
> > > > > >>>>>>> while the partition leaders and their logs hold the truth
> > about
> > > > > >>>> whether a
> > > > > >>>>>>> particular file contains live batches of this particular
> > > > partition.
> > > > > >>>>>>>
> > > > > >>>>>>> The file manager will do this important task: be able to
> say
> > > for
> > > > > sure
> > > > > >>>>>> that
> > > > > >>>>>>> a file does not contain any live batch of any existing
> > > partition.
> > > > > For
> > > > > >>>>>> this,
> > > > > >>>>>>> it will have to periodically check against the partition
> > > leaders.
> > > > > >>>>>>> Considering that batch deletion is irreversible, when we
> > > declare
> > > > a
> > > > > >>>> file
> > > > > >>>>>>> “empty”, this is guaranteed to be and stay so.
> > > > > >>>>>>>
> > > > > >>>>>>> The file manager has to know about files being committed to
> > > start
> > > > > >>>> track
> > > > > >>>>>>> them and periodically check if they are empty. We can
> > consider
> > > > > >>>> various
> > > > > >>>>>> ways
> > > > > >>>>>>> to achieve this:
> > > > > >>>>>>> 1. As was proposed in my previous message: best effort
> commit
> > > by
> > > > > >>>> brokers
> > > > > >>>>>> +
> > > > > >>>>>>> periodic prefix scans of object storage to detect files
> that
> > > went
> > > > > >>>> below
> > > > > >>>>>> the
> > > > > >>>>>>> radar due to network issue or the file manager temporary
> > > > > >>>> unavailability.
> > > > > >>>>>>> We’re speaking about listing the file names only and
> opening
> > > only
> > > > > >>>>>>> previously unknown files in order to find the partitions
> > > involved
> > > > > >>>> with
> > > > > >>>>>> them.
> > > > > >>>>>>> 2. Only do scans without explicit commit, i.e. fill the
> list
> > of
> > > > > files
> > > > > >>>>>>> fully asynchronously and in the background. This may be not
> > > ideal
> > > > > >>>> due to
> > > > > >>>>>>> costs and performance of scanning tons of files. However,
> the
> > > > > number
> > > > > >>>> of
> > > > > >>>>>>> live WAL files should be limited due to tiered storage
> > > > offloading +
> > > > > >>>> we
> > > > > >>>>>> can
> > > > > >>>>>>> optimize this if we give files some global soft order in
> > their
> > > > > names.
> > > > > >>>>>>>
> > > > > >>>>>>>> I am not sure how this design simplifies the
> implementation.
> > > The
> > > > > >>>>>> existing
> > > > > >>>>>>>> producer/replication code can't be simply reused.
> Adjusting
> > > both
> > > > > >>>> the
> > > > > >>>>>>> write
> > > > > >>>>>>>> path in the leader and the replication path in the
> follower
> > to
> > > > > >>>>>> understand
> > > > > >>>>>>>> batch-header only data is quite intrusive to the existing
> > > logic.
> > > > > >>>>>>>
> > > > > >>>>>>> It is true that we’ll have to change LocalLog and
> UnifiedLog
> > in
> > > > > >>>> order to
> > > > > >>>>>>> support these changes. However, it seems that idempotence,
> > > > > >>>> transactions,
> > > > > >>>>>>> queues, tiered storage will have to be changed less than
> with
> > > the
> > > > > >>>>>> original
> > > > > >>>>>>> design. This is because the partition leader state would
> > remain
> > > > in
> > > > > >>>> the
> > > > > >>>>>> same
> > > > > >>>>>>> place (on brokers) and existing workflows that involve it
> > would
> > > > > have
> > > > > >>>> to
> > > > > >>>>>> be
> > > > > >>>>>>> changed less compared to the situation where we globalize
> the
> > > > > >>>> partition
> > > > > >>>>>>> leader state in the batch coordinator. I admit this is hard
> > to
> > > > make
> > > > > >>>>>>> convincing without both real implementations to hand :)
> > > > > >>>>>>>
> > > > > >>>>>>>> I am also
> > > > > >>>>>>>> not sure how this enables seamless switching the topic
> modes
> > > > > >>>> between
> > > > > >>>>>>>> diskless and classic. Could you provide more details on
> > those?
> > > > > >>>>>>>
> > > > > >>>>>>> Let’s consider the scenario of turning a classic topic into
> > > > > >>>> diskless. The
> > > > > >>>>>>> user sets diskless.enabled=true, the leader receives this
> > > > metadata
> > > > > >>>> update
> > > > > >>>>>>> and does the following:
> > > > > >>>>>>> 1. Stop accepting normal append writes.
> > > > > >>>>>>> 2. Close the current active segment.
> > > > > >>>>>>> 3. Start a new segment that will be written in the diskless
> > > > format
> > > > > >>>> (i.e.
> > > > > >>>>>>> without data).
> > > > > >>>>>>> 4. Start accepting diskless commits.
> > > > > >>>>>>>
> > > > > >>>>>>> Since it’s the same log, the followers will know about that
> > > > switch
> > > > > >>>>>>> consistently. They will finish replicating the classic
> > segments
> > > > and
> > > > > >>>> start
> > > > > >>>>>>> replicating the diskless ones. They will always know where
> > each
> > > > > >>>> batch is
> > > > > >>>>>>> located (either inside a classic segment or referenced by a
> > > > > diskless
> > > > > >>>>>> one).
> > > > > >>>>>>> Switching back should be similar.
> > > > > >>>>>>>
> > > > > >>>>>>> Doing this with the coordinator is possible, but has some
> > > > caveats.
> > > > > >>>> The
> > > > > >>>>>>> leader must do the following:
> > > > > >>>>>>> 1. Stop accepting normal append writes.
> > > > > >>>>>>> 2. Close the current active segment.
> > > > > >>>>>>> 3. Write a special control segment to persist and replicate
> > the
> > > > > fact
> > > > > >>>> that
> > > > > >>>>>>> from offset N the partition is now in the diskless mode.
> > > > > >>>>>>> 4. Inform the coordinator about the first offset N of the
> > > > “diskless
> > > > > >>>> era”.
> > > > > >>>>>>> 5. Inform the controller quorum that the transition has
> > > finished
> > > > > and
> > > > > >>>> that
> > > > > >>>>>>> brokers now can process diskless writes for this partition.
> > > > > >>>>>>> This could fail at some points, so this will probably
> require
> > > > some
> > > > > >>>>>>> explicit state machine with replication either in the
> > partition
> > > > log
> > > > > >>>> or in
> > > > > >>>>>>> KRaft.
> > > > > >>>>>>>
> > > > > >>>>>>> It seems that the coordinator-less approach makes this
> > simpler
> > > > > >>>> because
> > > > > >>>>>> the
> > > > > >>>>>>> “coordinator” for the partition and the partition leader
> are
> > > the
> > > > > >>>> same and
> > > > > >>>>>>> they store the partition metadata in the same log, too.
> While
> > > in
> > > > > the
> > > > > >>>>>>> coordinator approach we have to perform some kind of a
> > > > distributed
> > > > > >>>> commit
> > > > > >>>>>>> to handover metadata management from the classic partition
> > > leader
> > > > > to
> > > > > >>>> the
> > > > > >>>>>>> batch coordinator.
> > > > > >>>>>>>
> > > > > >>>>>>> I hope these explanations help to clarify the idea. Please
> > let
> > > me
> > > > > >>>> know if
> > > > > >>>>>>> I should go deeper anywhere.
> > > > > >>>>>>>
> > > > > >>>>>>> Best,
> > > > > >>>>>>> Ivan and the Diskless team
> > > > > >>>>>>>
> > > > > >>>>>>> On Tue, Oct 7, 2025, at 01:44, Jun Rao wrote:
> > > > > >>>>>>>> Hi, Ivan,
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks for the update.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I am not sure that I fully understand the new design, but
> it
> > > > seems
> > > > > >>>> less
> > > > > >>>>>>>> clean than before.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Does each partition now have a metadata partition and a
> > > separate
> > > > > >>>> data
> > > > > >>>>>>>> partition? If so, I am concerned that it essentially
> doubles
> > > the
> > > > > >>>> number
> > > > > >>>>>>> of
> > > > > >>>>>>>> partitions, which impacts the number of open file
> > descriptors
> > > > and
> > > > > >>>> the
> > > > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a
> > > > > separate
> > > > > >>>>>>>> partition just to store the metadata. It's as if we are
> > > creating
> > > > > an
> > > > > >>>>>>>> internal topic with an unbounded number of partitions. Are
> > the
> > > > > >>>> metadata
> > > > > >>>>>>> and
> > > > > >>>>>>>> the data for the same partition always collocated on the
> > same
> > > > > >>>> broker?
> > > > > >>>>>> If
> > > > > >>>>>>>> so, how do we enforce that when replicas are reassigned?
> > > > > >>>>>>>>
> > > > > >>>>>>>> The number of RPCs in the produce path is significantly
> > > higher.
> > > > > For
> > > > > >>>>>>>> example, if a produce request has 100 partitions, in a
> > cluster
> > > > > >>>> with 100
> > > > > >>>>>>>> brokers, each produce request could generate 100 more RPC
> > > > > requests.
> > > > > >>>>>> This
> > > > > >>>>>>>> will significantly increase the request rate.
> > > > > >>>>>>>>
> > > > > >>>>>>>> The same WAL file metadata is now duplicated into two
> > places,
> > > > > >>>> partition
> > > > > >>>>>>>> leader and WAL File Manager. Which one is the source of
> > truth,
> > > > and
> > > > > >>>> how
> > > > > >>>>>> do
> > > > > >>>>>>>> we maintain consistency between the two places?
> > > > > >>>>>>>>
> > > > > >>>>>>>> I am not sure how this design simplifies the
> implementation.
> > > The
> > > > > >>>>>> existing
> > > > > >>>>>>>> producer/replication code can't be simply reused.
> Adjusting
> > > both
> > > > > >>>> the
> > > > > >>>>>>> write
> > > > > >>>>>>>> path in the leader and the replication path in the
> follower
> > to
> > > > > >>>>>> understand
> > > > > >>>>>>>> batch-header only data is quite intrusive to the existing
> > > > logic. I
> > > > > >>>> am
> > > > > >>>>>>> also
> > > > > >>>>>>>> not sure how this enables seamless switching the topic
> modes
> > > > > >>>> between
> > > > > >>>>>>>> diskless and classic. Could you provide more details on
> > those?
> > > > > >>>>>>>>
> > > > > >>>>>>>> Jun
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Thu, Oct 2, 2025 at 5:08 AM Ivan Yurchenko <
> > [email protected]
> > > >
> > > > > >>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hi dear Kafka community,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> In the initial Diskless proposal, we proposed to have a
> > > > separate
> > > > > >>>>>>>>> component, batch/diskless coordinator, whose role would
> be
> > to
> > > > > >>>>>> centrally
> > > > > >>>>>>>>> manage the batch and WAL file metadata for diskless
> topics.
> > > > This
> > > > > >>>>>>> component
> > > > > >>>>>>>>> drew many reasonable comments from the community about
> how
> > it
> > > > > >>>> would
> > > > > >>>>>>> support
> > > > > >>>>>>>>> various Kafka features (transactions, queues) and its
> > > > > >>>> scalability.
> > > > > >>>>>>> While we
> > > > > >>>>>>>>> believe we have good answers to all the expressed
> concerns,
> > > we
> > > > > >>>> took a
> > > > > >>>>>>> step
> > > > > >>>>>>>>> back and looked at the problem from a different
> > perspective.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> We would like to propose an alternative Diskless design
> > > > *without
> > > > > >>>> a
> > > > > >>>>>>>>> centralized coordinator*. We believe this approach has
> > > > potential
> > > > > >>>> and
> > > > > >>>>>>>>> propose to discuss it as it may be more appealing to the
> > > > > >>>> community.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Let us explain the idea. Most of the complications with
> the
> > > > > >>>> original
> > > > > >>>>>>>>> Diskless approach come from one necessary architecture
> > > change:
> > > > > >>>>>>> globalizing
> > > > > >>>>>>>>> the local state of partition leader in the batch
> > coordinator.
> > > > > >>>> This
> > > > > >>>>>>> causes
> > > > > >>>>>>>>> deviations to the established workflows in various
> features
> > > > like
> > > > > >>>>>>> produce
> > > > > >>>>>>>>> idempotence and transactions, queues, retention, etc.
> These
> > > > > >>>>>> deviations
> > > > > >>>>>>> need
> > > > > >>>>>>>>> to be carefully considered, designed, and later
> implemented
> > > and
> > > > > >>>>>>> tested. In
> > > > > >>>>>>>>> the new approach we want to avoid this by making
> partition
> > > > > >>>> leaders
> > > > > >>>>>>> again
> > > > > >>>>>>>>> responsible for managing their partitions, even in
> diskless
> > > > > >>>> topics.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> In classic Kafka topics, batch data and metadata are
> > blended
> > > > > >>>> together
> > > > > >>>>>>> in
> > > > > >>>>>>>>> the one partition log. The crux of the Diskless idea is
> to
> > > > > >>>> decouple
> > > > > >>>>>>> them
> > > > > >>>>>>>>> and move data to the remote storage, while keeping
> metadata
> > > > > >>>> somewhere
> > > > > >>>>>>> else.
> > > > > >>>>>>>>> Using the central batch coordinator for managing batch
> > > metadata
> > > > > >>>> is
> > > > > >>>>>> one
> > > > > >>>>>>> way,
> > > > > >>>>>>>>> but not the only.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Let’s now think about managing metadata for each user
> > > partition
> > > > > >>>>>>>>> independently. Generally partitions are independent and
> > don’t
> > > > > >>>> share
> > > > > >>>>>>>>> anything apart from that their data are mixed in WAL
> files.
> > > If
> > > > we
> > > > > >>>>>>> figure
> > > > > >>>>>>>>> out how to commit and later delete WAL files safely, we
> > will
> > > > > >>>> achieve
> > > > > >>>>>>> the
> > > > > >>>>>>>>> necessary autonomy that allows us to get rid of the
> central
> > > > batch
> > > > > >>>>>>>>> coordinator. Instead, *each diskless user partition will
> be
> > > > > >>>> managed
> > > > > >>>>>> by
> > > > > >>>>>>> its
> > > > > >>>>>>>>> leader*, as in classic Kafka topics. Also like in classic
> > > > > >>>> topics, the
> > > > > >>>>>>>>> leader uses the partition log as the way to persist batch
> > > > > >>>> metadata,
> > > > > >>>>>>> i.e.
> > > > > >>>>>>>>> the regular batch header + the information about how to
> > find
> > > > this
> > > > > >>>>>>> batch on
> > > > > >>>>>>>>> remote storage. In contrast to classic topics, batch data
> > is
> > > in
> > > > > >>>>>> remote
> > > > > >>>>>>>>> storage.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> For clarity, let’s compare the three designs:
> > > > > >>>>>>>>> • Classic topics:
> > > > > >>>>>>>>>  • Data and metadata are co-located in the partition log.
> > > > > >>>>>>>>>  • The partition log content: [Batch header
> > (metadata)|Batch
> > > > > >>>> data].
> > > > > >>>>>>>>>  • The partition log is replicated to the followers.
> > > > > >>>>>>>>>  • The replicas and leader have local state built from
> > > > > >>>> metadata.
> > > > > >>>>>>>>> • Original Diskless:
> > > > > >>>>>>>>>  • Metadata is in the batch coordinator, data is on
> remote
> > > > > >>>> storage.
> > > > > >>>>>>>>>  • The partition state is global in the batch
> coordinator.
> > > > > >>>>>>>>> • New Diskless:
> > > > > >>>>>>>>>  • Metadata is in the partition log, data is on remote
> > > storage.
> > > > > >>>>>>>>>  • Partition log content: [Batch header (metadata)|Batch
> > > > > >>>>>> coordinates
> > > > > >>>>>>> on
> > > > > >>>>>>>>> remote storage].
> > > > > >>>>>>>>>  • The partition log is replicated to the followers.
> > > > > >>>>>>>>>  • The replicas and leader have local state built from
> > > > > >>>> metadata.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Let’s consider the produce path. Here’s the reminder of
> the
> > > > > >>>> original
> > > > > >>>>>>>>> Diskless design:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The new approach could be depicted as the following:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> As you can see, the main difference is that now instead
> of
> > a
> > > > > >>>> single
> > > > > >>>>>>> commit
> > > > > >>>>>>>>> request to the batch coordinator, we send multiple
> parallel
> > > > > >>>> commit
> > > > > >>>>>>> requests
> > > > > >>>>>>>>> to all the leaders of each partition involved in the WAL
> > > file.
> > > > > >>>> Each
> > > > > >>>>>> of
> > > > > >>>>>>> them
> > > > > >>>>>>>>> will commit its batches independently, without
> coordinating
> > > > with
> > > > > >>>>>> other
> > > > > >>>>>>>>> leaders and any other components. Batch data is addressed
> > by
> > > > the
> > > > > >>>> WAL
> > > > > >>>>>>> file
> > > > > >>>>>>>>> name, the byte offset and size, which allows partitions
> to
> > > know
> > > > > >>>>>> nothing
> > > > > >>>>>>>>> about other partitions to access their data in shared WAL
> > > > files.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The number of partitions involved in a single WAL file
> may
> > be
> > > > > >>>> quite
> > > > > >>>>>>> large,
> > > > > >>>>>>>>> e.g. a hundred. A hundred network requests to commit one
> > WAL
> > > > > >>>> file is
> > > > > >>>>>>> very
> > > > > >>>>>>>>> impractical. However, there are ways to reduce this
> number:
> > > > > >>>>>>>>> 1. Partition leaders are located on brokers. Requests to
> > > > > >>>> leaders on
> > > > > >>>>>>> one
> > > > > >>>>>>>>> broker could be grouped together into a single physical
> > > network
> > > > > >>>>>> request
> > > > > >>>>>>>>> (resembling the normal Produce request that may carry
> > batches
> > > > for
> > > > > >>>>>> many
> > > > > >>>>>>>>> partitions inside). This will cap the number of network
> > > > requests
> > > > > >>>> to
> > > > > >>>>>> the
> > > > > >>>>>>>>> number of brokers in the cluster.
> > > > > >>>>>>>>> 2. If we craft the cluster metadata to make producers
> send
> > > > their
> > > > > >>>>>>> requests
> > > > > >>>>>>>>> to the right brokers (with respect to AZs), we may
> achieve
> > > the
> > > > > >>>> higher
> > > > > >>>>>>>>> concentration of logical commit requests in physical
> > network
> > > > > >>>> requests
> > > > > >>>>>>>>> reducing the number of the latter ones even further,
> > ideally
> > > to
> > > > > >>>> one.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Obviously, out of multiple commit requests some may fail
> or
> > > > time
> > > > > >>>> out
> > > > > >>>>>>> for a
> > > > > >>>>>>>>> variety of reasons. This is fine. Some producers will
> > receive
> > > > > >>>> totally
> > > > > >>>>>>> or
> > > > > >>>>>>>>> partially failed responses to their Produce requests,
> > similar
> > > > to
> > > > > >>>> what
> > > > > >>>>>>> they
> > > > > >>>>>>>>> would have received when appending to a classic topic
> fails
> > > or
> > > > > >>>> times
> > > > > >>>>>>> out.
> > > > > >>>>>>>>> If a partition experiences problems, other partitions
> will
> > > not
> > > > be
> > > > > >>>>>>> affected
> > > > > >>>>>>>>> (again, like in classic topics). Of course, the
> uncommitted
> > > > data
> > > > > >>>> will
> > > > > >>>>>>> be
> > > > > >>>>>>>>> garbage in WAL files. But WAL files are short-lived
> > (batches
> > > > are
> > > > > >>>>>>> constantly
> > > > > >>>>>>>>> assembled into segments and offloaded to tiered storage),
> > so
> > > > this
> > > > > >>>>>>> garbage
> > > > > >>>>>>>>> will be eventually deleted.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> For safely deleting WAL files we now need to centrally
> > manage
> > > > > >>>> them,
> > > > > >>>>>> as
> > > > > >>>>>>>>> this is the only state and logic that spans multiple
> > > > partitions.
> > > > > >>>> On
> > > > > >>>>>> the
> > > > > >>>>>>>>> diagram, you can see another commit request called
> “Commit
> > > file
> > > > > >>>> (best
> > > > > >>>>>>>>> effort)” going to the WAL File Manager. This manager will
> > be
> > > > > >>>>>>> responsible
> > > > > >>>>>>>>> for the following:
> > > > > >>>>>>>>> 1. Collecting (by requests from brokers) and persisting
> > > > > >>>> information
> > > > > >>>>>>> about
> > > > > >>>>>>>>> committed WAL files.
> > > > > >>>>>>>>> 2. To handle potential failures in file information
> > delivery,
> > > > it
> > > > > >>>>>> will
> > > > > >>>>>>> be
> > > > > >>>>>>>>> doing prefix scan on the remote storage periodically to
> > find
> > > > and
> > > > > >>>>>>> register
> > > > > >>>>>>>>> unknown files. The period of this scan will be
> configurable
> > > and
> > > > > >>>>>> ideally
> > > > > >>>>>>>>> should be quite long.
> > > > > >>>>>>>>> 3. Checking with the relevant partition leaders (after a
> > > grace
> > > > > >>>>>>> period) if
> > > > > >>>>>>>>> they still have batches in a particular file.
> > > > > >>>>>>>>> 4. Physically deleting files when they aren’t anymore
> > > referred
> > > > > >>>> to by
> > > > > >>>>>>> any
> > > > > >>>>>>>>> partition.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> This new design offers the following advantages:
> > > > > >>>>>>>>> 1. It simplifies the implementation of many Kafka
> features
> > > such
> > > > > >>>> as
> > > > > >>>>>>>>> idempotence, transactions, queues, tiered storage,
> > retention.
> > > > > >>>> Now we
> > > > > >>>>>>> don’t
> > > > > >>>>>>>>> need to abstract away and reuse the code from partition
> > > leaders
> > > > > >>>> in
> > > > > >>>>>> the
> > > > > >>>>>>>>> batch coordinator. Instead, we will literally use the
> same
> > > code
> > > > > >>>> paths
> > > > > >>>>>>> in
> > > > > >>>>>>>>> leaders, with little adaptation. Workflows from classic
> > > topics
> > > > > >>>> mostly
> > > > > >>>>>>>>> remain unchanged.
> > > > > >>>>>>>>> For example, it seems that
> > > > > >>>>>>>>>
> ReplicaManager.maybeSendPartitionsToTransactionCoordinator
> > > and
> > > > > >>>>>>>>> KafkaApis.handleWriteTxnMarkersRequest used for
> transaction
> > > > > >>>> support
> > > > > >>>>>> on
> > > > > >>>>>>> the
> > > > > >>>>>>>>> partition leader side could be used for diskless topics
> > with
> > > > > >>>> little
> > > > > >>>>>>>>> adaptation. ProducerStateManager, needed for both
> > idempotent
> > > > > >>>> produce
> > > > > >>>>>>> and
> > > > > >>>>>>>>> transactions, would be reused.
> > > > > >>>>>>>>> Another example is share groups support, where the share
> > > > > >>>> partition
> > > > > >>>>>>> leader,
> > > > > >>>>>>>>> being co-located with the partition leader, would execute
> > the
> > > > > >>>> same
> > > > > >>>>>>> logic
> > > > > >>>>>>>>> for both diskless and classic topics.
> > > > > >>>>>>>>> 2. It returns to the familiar partition-based scaling
> > model,
> > > > > >>>> where
> > > > > >>>>>>>>> partitions are independent.
> > > > > >>>>>>>>> 3. It makes the operation and failure patterns closer to
> > the
> > > > > >>>>>> familiar
> > > > > >>>>>>>>> ones from classic topics.
> > > > > >>>>>>>>> 4. It opens a straightforward path to seamless switching
> > the
> > > > > >>>> topics
> > > > > >>>>>>> modes
> > > > > >>>>>>>>> between diskless and classic.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The rest of the things remain unchanged compared to the
> > > > previous
> > > > > >>>>>>> Diskless
> > > > > >>>>>>>>> design (after all previous discussions). Such things as
> > local
> > > > > >>>> segment
> > > > > >>>>>>>>> materialization by replicas, the consume path, tiered
> > storage
> > > > > >>>>>>> integration,
> > > > > >>>>>>>>> etc.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> If the community finds this design more suitable, we will
> > > > update
> > > > > >>>> the
> > > > > >>>>>>>>> KIP(s) accordingly and continue working on it. Please let
> > us
> > > > know
> > > > > >>>>>> what
> > > > > >>>>>>> you
> > > > > >>>>>>>>> think.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best regards,
> > > > > >>>>>>>>> Ivan and Diskless team
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote:
> > > > > >>>>>>>>>> Hi Justine,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Yes, you're right. We need to track the aborted
> > transactions
> > > > > >>>> for in
> > > > > >>>>>>> the
> > > > > >>>>>>>>> diskless coordinator for as long as the corresponding
> > offsets
> > > > are
> > > > > >>>>>>> there.
> > > > > >>>>>>>>> With the tiered storage unification Greg mentioned
> earlier,
> > > > this
> > > > > >>>> will
> > > > > >>>>>>> be
> > > > > >>>>>>>>> finite time even for infinite data retention.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Best,
> > > > > >>>>>>>>>> Ivan
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote:
> > > > > >>>>>>>>>>> Hey Ivan,
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks for the response. I think most of what you said
> > made
> > > > > >>>>>> sense,
> > > > > >>>>>>> but
> > > > > >>>>>>>>> I
> > > > > >>>>>>>>>>> did have some questions about this part:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> As we understand this, the partition leader in classic
> > > > > >>>> topics
> > > > > >>>>>>> forgets
> > > > > >>>>>>>>>>> about a transaction once it’s replicated (HWM
> overpasses
> > > > > >>>> it). The
> > > > > >>>>>>>>>>> transaction coordinator acts like the main guardian,
> > > allowing
> > > > > >>>>>>> partition
> > > > > >>>>>>>>>>> leaders to do this safely. Please correct me if this is
> > > > > >>>> wrong. We
> > > > > >>>>>>> think
> > > > > >>>>>>>>>>> about relying on this with the batch coordinator and
> > delete
> > > > > >>>> the
> > > > > >>>>>>>>> information
> > > > > >>>>>>>>>>> about a transaction once it’s finished (as there’s no
> > > > > >>>> replication
> > > > > >>>>>>> and
> > > > > >>>>>>>>> HWM
> > > > > >>>>>>>>>>> advances immediately).
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> I didn't quite understand this. In classic topics, we
> > have
> > > > > >>>> maps
> > > > > >>>>>> for
> > > > > >>>>>>>>> ongoing
> > > > > >>>>>>>>>>> transactions which remove state when the transaction is
> > > > > >>>> completed
> > > > > >>>>>>> and
> > > > > >>>>>>>>> an
> > > > > >>>>>>>>>>> aborted transactions index which is retained for much
> > > longer.
> > > > > >>>>>> Once
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>> transaction is completed, the coordinator is no longer
> > > > > >>>> involved
> > > > > >>>>>> in
> > > > > >>>>>>>>>>> maintaining this partition side state, and it is
> subject
> > to
> > > > > >>>>>>> compaction
> > > > > >>>>>>>>> etc.
> > > > > >>>>>>>>>>> Looking back at the outline provided above, I didn't
> see
> > > much
> > > > > >>>>>>> about the
> > > > > >>>>>>>>>>> fetch path, so maybe that could be expanded a bit
> > further.
> > > I
> > > > > >>>> saw
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>> following in a response:
> > > > > >>>>>>>>>>>> When the broker constructs a fully valid local
> segment,
> > > > > >>>> all the
> > > > > >>>>>>>>> necessary
> > > > > >>>>>>>>>>> control batches will be inserted and indices, including
> > the
> > > > > >>>>>>> transaction
> > > > > >>>>>>>>>>> index will be built to serve FetchRequests exactly as
> > they
> > > > > >>>> are
> > > > > >>>>>>> today.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Based on this, it seems like we need to retain the
> > > > > >>>> information
> > > > > >>>>>>> about
> > > > > >>>>>>>>>>> aborted txns for longer.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>> Justine
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko <
> > > > > >>>> [email protected]>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>> Hi Justine and all,
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thank you for your questions!
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely
> identified
> > > > > >>>> with
> > > > > >>>>>>>>> producer ID
> > > > > >>>>>>>>>>>>> and epoch, the positive result of this check could be
> > > > > >>>> cached
> > > > > >>>>>>>>> locally
> > > > > >>>>>>>>>>>>> Are we saying that only new transaction version 2
> > > > > >>>>>> transactions
> > > > > >>>>>>> can
> > > > > >>>>>>>>> be
> > > > > >>>>>>>>>>>> used
> > > > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions
> > > > > >>>> with
> > > > > >>>>>>>>> producer id +
> > > > > >>>>>>>>>>>>> epoch
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> You’re right that we (probably unintentionally)
> focused
> > > > > >>>> only on
> > > > > >>>>>>>>> version 2.
> > > > > >>>>>>>>>>>> We can either limit the support to version 2 or
> consider
> > > > > >>>> using
> > > > > >>>>>>> some
> > > > > >>>>>>>>>>>> surrogates to support version 1.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final
> > transactional
> > > > > >>>>>>> checks
> > > > > >>>>>>>>> of the
> > > > > >>>>>>>>>>>>> batches. This procedure would output the same errors
> > > > > >>>> like the
> > > > > >>>>>>>>> partition
> > > > > >>>>>>>>>>>>> leader in classic topics would do.
> > > > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be
> > > > > >>>>>> checking
> > > > > >>>>>>> if
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>> transaction was still ongoing for example?* *
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Yes, the producer epoch, that the transaction is
> > ongoing,
> > > > > >>>> and
> > > > > >>>>>> of
> > > > > >>>>>>>>> course
> > > > > >>>>>>>>>>>> the normal idempotence checks. What the partition
> leader
> > > > > >>>> in the
> > > > > >>>>>>>>> classic
> > > > > >>>>>>>>>>>> topics does before appending a batch to the local log
> > > > > >>>> (e.g. in
> > > > > >>>>>>>>>>>> UnifiedLog.maybeStartTransactionVerification and
> > > > > >>>>>>>>>>>> UnifiedLog.analyzeAndValidateProducerState). In
> > Diskless,
> > > > > >>>> we
> > > > > >>>>>>>>> unfortunately
> > > > > >>>>>>>>>>>> cannot do these checks before appending the data to
> the
> > > WAL
> > > > > >>>>>>> segment
> > > > > >>>>>>>>> and
> > > > > >>>>>>>>>>>> uploading it, but we can “tombstone” these batches in
> > the
> > > > > >>>> batch
> > > > > >>>>>>>>> coordinator
> > > > > >>>>>>>>>>>> during the final commit.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Is there state about ongoing
> > > > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some
> other
> > > > > >>>> state
> > > > > >>>>>>>>> mentioned
> > > > > >>>>>>>>>>>> in
> > > > > >>>>>>>>>>>>> the End transaction section, but it's not super clear
> > > > > >>>> what
> > > > > >>>>>>> state is
> > > > > >>>>>>>>>>>> stored
> > > > > >>>>>>>>>>>>> and when it is stored.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Right, this should have been more explicit. As the
> > > > > >>>> partition
> > > > > >>>>>>> leader
> > > > > >>>>>>>>> tracks
> > > > > >>>>>>>>>>>> ongoing transactions for classic topics, the batch
> > > > > >>>> coordinator
> > > > > >>>>>>> has
> > > > > >>>>>>>>> to as
> > > > > >>>>>>>>>>>> well. So when a transaction starts and ends, the
> > > > > >>>> transaction
> > > > > >>>>>>>>> coordinator
> > > > > >>>>>>>>>>>> must inform the batch coordinator about this.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO --
> > > > > >>>> perhaps
> > > > > >>>>>>> that
> > > > > >>>>>>>>> would
> > > > > >>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>> stored in the batch coordinator?
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Yes. This could be deduced from the committed batches
> > and
> > > > > >>>> other
> > > > > >>>>>>>>>>>> information, but for the sake of performance we’d
> better
> > > > > >>>> store
> > > > > >>>>>> it
> > > > > >>>>>>>>>>>> explicitly.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long
> > transactional
> > > > > >>>>>>> state is
> > > > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will
> be
> > > > > >>>>>> cleaned
> > > > > >>>>>>> up?
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> As we understand this, the partition leader in classic
> > > > > >>>> topics
> > > > > >>>>>>> forgets
> > > > > >>>>>>>>>>>> about a transaction once it’s replicated (HWM
> overpasses
> > > > > >>>> it).
> > > > > >>>>>> The
> > > > > >>>>>>>>>>>> transaction coordinator acts like the main guardian,
> > > > > >>>> allowing
> > > > > >>>>>>>>> partition
> > > > > >>>>>>>>>>>> leaders to do this safely. Please correct me if this
> is
> > > > > >>>> wrong.
> > > > > >>>>>> We
> > > > > >>>>>>>>> think
> > > > > >>>>>>>>>>>> about relying on this with the batch coordinator and
> > > > > >>>> delete the
> > > > > >>>>>>>>> information
> > > > > >>>>>>>>>>>> about a transaction once it’s finished (as there’s no
> > > > > >>>>>> replication
> > > > > >>>>>>>>> and HWM
> > > > > >>>>>>>>>>>> advances immediately).
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>> Ivan
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
> > > > > >>>>>>>>>>>>> Hey folks,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Excited to see some updates related to transactions!
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> I had a few questions.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely
> identified
> > > > > >>>> with
> > > > > >>>>>>>>> producer ID
> > > > > >>>>>>>>>>>>> and epoch, the positive result of this check could be
> > > > > >>>> cached
> > > > > >>>>>>>>> locally
> > > > > >>>>>>>>>>>>> Are we saying that only new transaction version 2
> > > > > >>>>>> transactions
> > > > > >>>>>>> can
> > > > > >>>>>>>>> be
> > > > > >>>>>>>>>>>> used
> > > > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions
> > > > > >>>> with
> > > > > >>>>>>>>> producer id +
> > > > > >>>>>>>>>>>>> epoch
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final
> > transactional
> > > > > >>>>>>> checks
> > > > > >>>>>>>>> of the
> > > > > >>>>>>>>>>>>> batches. This procedure would output the same errors
> > > > > >>>> like the
> > > > > >>>>>>>>> partition
> > > > > >>>>>>>>>>>>> leader in classic topics would do.
> > > > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be
> > > > > >>>>>> checking
> > > > > >>>>>>> if
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>> transaction was still ongoing for example? Is there
> > state
> > > > > >>>>>> about
> > > > > >>>>>>>>> ongoing
> > > > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some
> other
> > > > > >>>> state
> > > > > >>>>>>>>> mentioned
> > > > > >>>>>>>>>>>> in
> > > > > >>>>>>>>>>>>> the End transaction section, but it's not super clear
> > > > > >>>> what
> > > > > >>>>>>> state is
> > > > > >>>>>>>>>>>> stored
> > > > > >>>>>>>>>>>>> and when it is stored.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO --
> > > > > >>>> perhaps
> > > > > >>>>>>> that
> > > > > >>>>>>>>> would
> > > > > >>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>> stored in the batch coordinator?
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long
> > transactional
> > > > > >>>>>>> state is
> > > > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will
> be
> > > > > >>>>>> cleaned
> > > > > >>>>>>> up?
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Mon, Sep 8, 2025 at 10:38 AM Jun Rao
> > > > > >>>>>>> <[email protected]>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Hi, Greg and Ivan,
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Thanks for the update. A few comments.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> JR 10. "Consumer fetches are now served from local
> > > > > >>>>>> segments,
> > > > > >>>>>>>>> making
> > > > > >>>>>>>>>>>> use of
> > > > > >>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> indexes, page cache, request purgatory, and
> zero-copy
> > > > > >>>>>>>>> functionality
> > > > > >>>>>>>>>>>> already
> > > > > >>>>>>>>>>>>>> built into classic topics."
> > > > > >>>>>>>>>>>>>> JR 10.1 Does the broker build the producer state for
> > > > > >>>> each
> > > > > >>>>>>>>> partition in
> > > > > >>>>>>>>>>>>>> diskless topics?
> > > > > >>>>>>>>>>>>>> JR 10.2 For transactional data, the consumer fetches
> > > > > >>>> need
> > > > > >>>>>> to
> > > > > >>>>>>> know
> > > > > >>>>>>>>>>>> aborted
> > > > > >>>>>>>>>>>>>> records. How is that achieved?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> JR 11. "The batch coordinator saves that the
> > > > > >>>> transaction is
> > > > > >>>>>>>>> finished
> > > > > >>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>> also inserts the control batches in the
> corresponding
> > > > > >>>> logs
> > > > > >>>>>>> of the
> > > > > >>>>>>>>>>>> involved
> > > > > >>>>>>>>>>>>>> Diskless topics. This happens only on the metadata
> > > > > >>>> level,
> > > > > >>>>>> no
> > > > > >>>>>>>>> actual
> > > > > >>>>>>>>>>>> control
> > > > > >>>>>>>>>>>>>> batches are written to any file. "
> > > > > >>>>>>>>>>>>>> A fetch response could include multiple
> transactional
> > > > > >>>>>>> batches.
> > > > > >>>>>>>>> How
> > > > > >>>>>>>>>>>> does the
> > > > > >>>>>>>>>>>>>> broker obtain the information about the ending
> control
> > > > > >>>>>> batch
> > > > > >>>>>>> for
> > > > > >>>>>>>>> each
> > > > > >>>>>>>>>>>>>> batch? Does that mean that a fetch response needs to
> > be
> > > > > >>>>>>> built by
> > > > > >>>>>>>>>>>>>> stitching record batches and generated control
> batches
> > > > > >>>>>>> together?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> JR 12. Queues: Is there still a share partition
> leader
> > > > > >>>> that
> > > > > >>>>>>> all
> > > > > >>>>>>>>>>>> consumers
> > > > > >>>>>>>>>>>>>> are routed to?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> JR 13. "Should the KIPs be modified to include this
> or
> > > > > >>>> it's
> > > > > >>>>>>> too
> > > > > >>>>>>>>>>>>>> implementation-focused?" It would be useful to
> include
> > > > > >>>>>> enough
> > > > > >>>>>>>>> details
> > > > > >>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> understand correctness and performance impact.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> HC5. Henry has a valid point. Requests from a given
> > > > > >>>>>> producer
> > > > > >>>>>>>>> contain a
> > > > > >>>>>>>>>>>>>> sequence number, which is ordered. If a producer
> sends
> > > > > >>>>>> every
> > > > > >>>>>>>>> Produce
> > > > > >>>>>>>>>>>>>> request to an arbitrary broker, those requests could
> > > > > >>>> reach
> > > > > >>>>>>> the
> > > > > >>>>>>>>> batch
> > > > > >>>>>>>>>>>>>> coordinator in different order and lead to rejection
> > > > > >>>> of the
> > > > > >>>>>>>>> produce
> > > > > >>>>>>>>>>>>>> requests.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Jun
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <
> > > > > >>>>>>> [email protected]>
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> We have also thought in a bit more details about
> > > > > >>>>>>> transactions
> > > > > >>>>>>>>> and
> > > > > >>>>>>>>>>>> queues,
> > > > > >>>>>>>>>>>>>>> here's the plan.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> *Transactions*
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The support for transactions in *classic topics* is
> > > > > >>>> based
> > > > > >>>>>>> on
> > > > > >>>>>>>>> precise
> > > > > >>>>>>>>>>>>>>> interactions between three actors: clients (mostly
> > > > > >>>>>>> producers,
> > > > > >>>>>>>>> but
> > > > > >>>>>>>>>>>> also
> > > > > >>>>>>>>>>>>>>> consumers), brokers (ReplicaManager and other
> > > > > >>>> classes),
> > > > > >>>>>> and
> > > > > >>>>>>>>>>>> transaction
> > > > > >>>>>>>>>>>>>>> coordinators. Brokers also run partition leaders
> with
> > > > > >>>>>> their
> > > > > >>>>>>>>> local
> > > > > >>>>>>>>>>>> state
> > > > > >>>>>>>>>>>>>>> (ProducerStateManager and others).
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The high level (some details skipped) workflow is
> the
> > > > > >>>>>>>>> following.
> > > > > >>>>>>>>>>>> When a
> > > > > >>>>>>>>>>>>>>> transactional Produce request is received by the
> > > > > >>>> broker:
> > > > > >>>>>>>>>>>>>>> 1. For each partition, the partition leader checks
> > > > > >>>> if a
> > > > > >>>>>>>>> non-empty
> > > > > >>>>>>>>>>>>>>> transaction is running for this partition. This is
> > > > > >>>> done
> > > > > >>>>>>> using
> > > > > >>>>>>>>> its
> > > > > >>>>>>>>>>>> local
> > > > > >>>>>>>>>>>>>>> state derived from the log metadata
> > > > > >>>>>> (ProducerStateManager,
> > > > > >>>>>>>>>>>>>>> VerificationStateEntry, VerificationGuard).
> > > > > >>>>>>>>>>>>>>> 2. The transaction coordinator is informed about
> all
> > > > > >>>> the
> > > > > >>>>>>>>> partitions
> > > > > >>>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>> aren’t part of the transaction to include them.
> > > > > >>>>>>>>>>>>>>> 3. The partition leaders do additional
> transactional
> > > > > >>>>>>> checks.
> > > > > >>>>>>>>>>>>>>> 4. The partition leaders append the transactional
> > > > > >>>> data to
> > > > > >>>>>>>>> their logs
> > > > > >>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>> update some of their state (for example, log the
> fact
> > > > > >>>>>> that
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>>> transaction
> > > > > >>>>>>>>>>>>>>> is running for the partition and its first offset).
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> When the transaction is committed or aborted:
> > > > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction
> coordinator
> > > > > >>>>>>> directly
> > > > > >>>>>>>>> with
> > > > > >>>>>>>>>>>>>>> EndTxnRequest.
> > > > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes
> PREPARE_COMMIT
> > > > > >>>> or
> > > > > >>>>>>>>>>>> PREPARE_ABORT to
> > > > > >>>>>>>>>>>>>>> its log and responds to the producer.
> > > > > >>>>>>>>>>>>>>> 3. The transaction coordinator sends
> > > > > >>>>>>> WriteTxnMarkersRequest to
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>>> leaders
> > > > > >>>>>>>>>>>>>>> of the involved partitions.
> > > > > >>>>>>>>>>>>>>> 4. The partition leaders write the transaction
> > > > > >>>> markers to
> > > > > >>>>>>>>> their logs
> > > > > >>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>> respond to the coordinator.
> > > > > >>>>>>>>>>>>>>> 5. The coordinator writes the final transaction
> state
> > > > > >>>>>>>>>>>> COMPLETE_COMMIT or
> > > > > >>>>>>>>>>>>>>> COMPLETE_ABORT.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> In classic topics, partitions have leaders and lots
> > > > > >>>> of
> > > > > >>>>>>>>> important
> > > > > >>>>>>>>>>>> state
> > > > > >>>>>>>>>>>>>>> necessary for supporting this workflow is local.
> The
> > > > > >>>> main
> > > > > >>>>>>>>> challenge
> > > > > >>>>>>>>>>>> in
> > > > > >>>>>>>>>>>>>>> mapping this to Diskless comes from the fact there
> > > > > >>>> are no
> > > > > >>>>>>>>> partition
> > > > > >>>>>>>>>>>>>>> leaders, so the corresponding pieces of state need
> > > > > >>>> to be
> > > > > >>>>>>>>> globalized
> > > > > >>>>>>>>>>>> in
> > > > > >>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> batch coordinator. We are already doing this to
> > > > > >>>> support
> > > > > >>>>>>>>> idempotent
> > > > > >>>>>>>>>>>>>> produce.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The high level workflow for *diskless topics* would
> > > > > >>>> look
> > > > > >>>>>>> very
> > > > > >>>>>>>>>>>> similar:
> > > > > >>>>>>>>>>>>>>> 1. For each partition, the broker checks if a
> > > > > >>>> non-empty
> > > > > >>>>>>>>> transaction
> > > > > >>>>>>>>>>>> is
> > > > > >>>>>>>>>>>>>>> running for this partition. In contrast to classic
> > > > > >>>>>> topics,
> > > > > >>>>>>>>> this is
> > > > > >>>>>>>>>>>>>> checked
> > > > > >>>>>>>>>>>>>>> against the batch coordinator with a single RPC.
> > > > > >>>> Since a
> > > > > >>>>>>>>> transaction
> > > > > >>>>>>>>>>>>>> could
> > > > > >>>>>>>>>>>>>>> be uniquely identified with producer ID and epoch,
> > > > > >>>> the
> > > > > >>>>>>> positive
> > > > > >>>>>>>>>>>> result of
> > > > > >>>>>>>>>>>>>>> this check could be cached locally (for the double
> > > > > >>>>>>> configured
> > > > > >>>>>>>>>>>> duration
> > > > > >>>>>>>>>>>>>> of a
> > > > > >>>>>>>>>>>>>>> transaction, for example).
> > > > > >>>>>>>>>>>>>>> 2. The same: The transaction coordinator is
> informed
> > > > > >>>>>> about
> > > > > >>>>>>> all
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> partitions that aren’t part of the transaction to
> > > > > >>>> include
> > > > > >>>>>>> them.
> > > > > >>>>>>>>>>>>>>> 3. No transactional checks are done on the broker
> > > > > >>>> side.
> > > > > >>>>>>>>>>>>>>> 4. The broker appends the transactional data to the
> > > > > >>>>>> current
> > > > > >>>>>>>>> shared
> > > > > >>>>>>>>>>>> WAL
> > > > > >>>>>>>>>>>>>>> segment. It doesn’t update any transaction-related
> > > > > >>>> state
> > > > > >>>>>>> for
> > > > > >>>>>>>>> Diskless
> > > > > >>>>>>>>>>>>>>> topics, because it doesn’t have any.
> > > > > >>>>>>>>>>>>>>> 5. The WAL segment is committed to the batch
> > > > > >>>> coordinator
> > > > > >>>>>>> like
> > > > > >>>>>>>>> in the
> > > > > >>>>>>>>>>>>>>> normal produce flow.
> > > > > >>>>>>>>>>>>>>> 6. The batch coordinator does the final
> transactional
> > > > > >>>>>>> checks
> > > > > >>>>>>>>> of the
> > > > > >>>>>>>>>>>>>>> batches. This procedure would output the same
> errors
> > > > > >>>> like
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>> partition
> > > > > >>>>>>>>>>>>>>> leader in classic topics would do. I.e. some
> batches
> > > > > >>>>>> could
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>>> rejected.
> > > > > >>>>>>>>>>>>>>> This means, there will potentially be garbage in
> the
> > > > > >>>> WAL
> > > > > >>>>>>>>> segment
> > > > > >>>>>>>>>>>> file in
> > > > > >>>>>>>>>>>>>>> case of transactional errors. This is preferable to
> > > > > >>>> doing
> > > > > >>>>>>> more
> > > > > >>>>>>>>>>>> network
> > > > > >>>>>>>>>>>>>>> round trips, especially considering the WAL
> segments
> > > > > >>>> will
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>>> relatively
> > > > > >>>>>>>>>>>>>>> short-living (see the Greg's update above).
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> When the transaction is committed or aborted:
> > > > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction
> coordinator
> > > > > >>>>>>> directly
> > > > > >>>>>>>>> with
> > > > > >>>>>>>>>>>>>>> EndTxnRequest.
> > > > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes
> PREPARE_COMMIT
> > > > > >>>> or
> > > > > >>>>>>>>>>>> PREPARE_ABORT to
> > > > > >>>>>>>>>>>>>>> its log and responds to the producer.
> > > > > >>>>>>>>>>>>>>> 3. *[NEW]* The transaction coordinator informs the
> > > > > >>>> batch
> > > > > >>>>>>>>> coordinator
> > > > > >>>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>> the transaction is finished.
> > > > > >>>>>>>>>>>>>>> 4. *[NEW]* The batch coordinator saves that the
> > > > > >>>>>>> transaction is
> > > > > >>>>>>>>>>>> finished
> > > > > >>>>>>>>>>>>>>> and also inserts the control batches in the
> > > > > >>>> corresponding
> > > > > >>>>>>> logs
> > > > > >>>>>>>>> of the
> > > > > >>>>>>>>>>>>>>> involved Diskless topics. This happens only on the
> > > > > >>>>>> metadata
> > > > > >>>>>>>>> level, no
> > > > > >>>>>>>>>>>>>>> actual control batches are written to any file.
> They
> > > > > >>>> will
> > > > > >>>>>>> be
> > > > > >>>>>>>>>>>> dynamically
> > > > > >>>>>>>>>>>>>>> created on Fetch and other read operations. We
> could
> > > > > >>>>>>>>> technically
> > > > > >>>>>>>>>>>> write
> > > > > >>>>>>>>>>>>>>> these control batches for real, but this would mean
> > > > > >>>> extra
> > > > > >>>>>>>>> produce
> > > > > >>>>>>>>>>>>>> latency,
> > > > > >>>>>>>>>>>>>>> so it's better just to mark them in the batch
> > > > > >>>> coordinator
> > > > > >>>>>>> and
> > > > > >>>>>>>>> save
> > > > > >>>>>>>>>>>> these
> > > > > >>>>>>>>>>>>>>> milliseconds.
> > > > > >>>>>>>>>>>>>>> 5. The transaction coordinator sends
> > > > > >>>>>>> WriteTxnMarkersRequest to
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>>> leaders
> > > > > >>>>>>>>>>>>>>> of the involved partitions. – Now only to classic
> > > > > >>>> topics
> > > > > >>>>>>> now.
> > > > > >>>>>>>>>>>>>>> 6. The partition leaders of classic topics write
> the
> > > > > >>>>>>>>> transaction
> > > > > >>>>>>>>>>>> markers
> > > > > >>>>>>>>>>>>>>> to their logs and respond to the coordinator.
> > > > > >>>>>>>>>>>>>>> 7. The coordinator writes the final transaction
> state
> > > > > >>>>>>>>>>>> COMPLETE_COMMIT or
> > > > > >>>>>>>>>>>>>>> COMPLETE_ABORT.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Compared to the non-transactional produce flow, we
> > > > > >>>> get:
> > > > > >>>>>>>>>>>>>>> 1. An extra network round trip between brokers and
> > > > > >>>> the
> > > > > >>>>>>> batch
> > > > > >>>>>>>>>>>> coordinator
> > > > > >>>>>>>>>>>>>>> when a new partition appear in the transaction. To
> > > > > >>>>>>> mitigate the
> > > > > >>>>>>>>>>>> impact of
> > > > > >>>>>>>>>>>>>>> them:
> > > > > >>>>>>>>>>>>>>> - The results will be cached.
> > > > > >>>>>>>>>>>>>>> - The calls for multiple partitions in one Produce
> > > > > >>>>>>> request
> > > > > >>>>>>>>> will be
> > > > > >>>>>>>>>>>>>>> grouped.
> > > > > >>>>>>>>>>>>>>> - The batch coordinator should be optimized for
> > > > > >>>> fast
> > > > > >>>>>>>>> response to
> > > > > >>>>>>>>>>>> these
> > > > > >>>>>>>>>>>>>>> RPCs.
> > > > > >>>>>>>>>>>>>>> - The fact that a single producer normally will
> > > > > >>>>>>> communicate
> > > > > >>>>>>>>> with a
> > > > > >>>>>>>>>>>>>>> single broker for the duration of the transaction
> > > > > >>>> further
> > > > > >>>>>>>>> reduces the
> > > > > >>>>>>>>>>>>>>> expected number of round trips.
> > > > > >>>>>>>>>>>>>>> 2. An extra round trip between the transaction
> > > > > >>>>>> coordinator
> > > > > >>>>>>> and
> > > > > >>>>>>>>> batch
> > > > > >>>>>>>>>>>>>>> coordinator when a transaction is finished.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> With this proposal, transactions will also be able
> to
> > > > > >>>>>> span
> > > > > >>>>>>> both
> > > > > >>>>>>>>>>>> classic
> > > > > >>>>>>>>>>>>>>> and Diskless topics.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> *Queues*
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> The share group coordination and management is a
> > > > > >>>> side job
> > > > > >>>>>>> that
> > > > > >>>>>>>>>>>> doesn't
> > > > > >>>>>>>>>>>>>>> interfere with the topic itself (leadership,
> > > > > >>>> replicas,
> > > > > >>>>>>> physical
> > > > > >>>>>>>>>>>> storage
> > > > > >>>>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>> records, etc.) and non-queue producers and
> consumers
> > > > > >>>>>>> (Fetch and
> > > > > >>>>>>>>>>>> Produce
> > > > > >>>>>>>>>>>>>>> RPCs, consumer group-related RPCs are not
> affected.)
> > > > > >>>> We
> > > > > >>>>>>> don't
> > > > > >>>>>>>>> see any
> > > > > >>>>>>>>>>>>>>> reason why we can't make Diskless topics compatible
> > > > > >>>> with
> > > > > >>>>>>> share
> > > > > >>>>>>>>>>>> groups the
> > > > > >>>>>>>>>>>>>>> same way as classic topics are. Even on the code
> > > > > >>>> level,
> > > > > >>>>>> we
> > > > > >>>>>>>>> don't
> > > > > >>>>>>>>>>>> expect
> > > > > >>>>>>>>>>>>>> any
> > > > > >>>>>>>>>>>>>>> serious refactoring: the same reading routines are
> > > > > >>>> used
> > > > > >>>>>>> that
> > > > > >>>>>>>>> are
> > > > > >>>>>>>>>>>> used for
> > > > > >>>>>>>>>>>>>>> fetching (e.g. ReplicaManager.readFromLog).
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Should the KIPs be modified to include this or it's
> > > > > >>>> too
> > > > > >>>>>>>>>>>>>>> implementation-focused?
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Best regards,
> > > > > >>>>>>>>>>>>>>> Ivan
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> > > > > >>>>>>>>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Thank you all for your questions and design input
> > > > > >>>> on
> > > > > >>>>>>>>> KIP-1150.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> We have just updated KIP-1150 and KIP-1163 with a
> > > > > >>>> new
> > > > > >>>>>>>>> design. To
> > > > > >>>>>>>>>>>>>>> summarize
> > > > > >>>>>>>>>>>>>>>> the changes:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> 1. The design prioritizes integrating with the
> > > > > >>>> existing
> > > > > >>>>>>>>> KIP-405
> > > > > >>>>>>>>>>>> Tiered
> > > > > >>>>>>>>>>>>>>>> Storage interfaces, permitting data produced to a
> > > > > >>>>>>> Diskless
> > > > > >>>>>>>>> topic
> > > > > >>>>>>>>>>>> to be
> > > > > >>>>>>>>>>>>>>>> moved to tiered storage.
> > > > > >>>>>>>>>>>>>>>> This lowers the scalability requirements for the
> > > > > >>>> Batch
> > > > > >>>>>>>>> Coordinator
> > > > > >>>>>>>>>>>>>>>> component, and allows Diskless to compose with
> > > > > >>>> Tiered
> > > > > >>>>>>> Storage
> > > > > >>>>>>>>>>>> plugin
> > > > > >>>>>>>>>>>>>>>> features such as encryption and alternative data
> > > > > >>>>>> formats.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> 2. Consumer fetches are now served from local
> > > > > >>>> segments,
> > > > > >>>>>>>>> making use
> > > > > >>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>> indexes, page cache, request purgatory, and
> > > > > >>>> zero-copy
> > > > > >>>>>>>>> functionality
> > > > > >>>>>>>>>>>>>>> already
> > > > > >>>>>>>>>>>>>>>> built into classic topics.
> > > > > >>>>>>>>>>>>>>>> However, local segments are now considered cache
> > > > > >>>>>>> elements,
> > > > > >>>>>>>>> do not
> > > > > >>>>>>>>>>>> need
> > > > > >>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>> be durably stored, and can be built without
> > > > > >>>> contacting
> > > > > >>>>>>> any
> > > > > >>>>>>>>> other
> > > > > >>>>>>>>>>>>>>> replicas.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> 3. The design has been simplified substantially,
> by
> > > > > >>>>>>> removing
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>>>> previous
> > > > > >>>>>>>>>>>>>>>> Diskless consume flow, distributed cache
> > > > > >>>> component, and
> > > > > >>>>>>>>> "object
> > > > > >>>>>>>>>>>>>>>> compaction/merging" step.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> The design maintains leaderless produces as
> > > > > >>>> enabled by
> > > > > >>>>>>> the
> > > > > >>>>>>>>> Batch
> > > > > >>>>>>>>>>>>>>>> Coordinator, and the same latency profiles as the
> > > > > >>>>>> earlier
> > > > > >>>>>>>>> design,
> > > > > >>>>>>>>>>>> while
> > > > > >>>>>>>>>>>>>>>> being simpler and integrating better into the
> > > > > >>>> existing
> > > > > >>>>>>>>> ecosystem.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Thanks, and we are eager to hear your feedback on
> > > > > >>>> the
> > > > > >>>>>> new
> > > > > >>>>>>>>> design.
> > > > > >>>>>>>>>>>>>>>> Greg Harris
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:30 PM Jun Rao
> > > > > >>>>>>>>> <[email protected]>
> > > > > >>>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Hi, Jan,
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> For me, the main gap of KIP-1150 is the support
> > > > > >>>> of
> > > > > >>>>>> all
> > > > > >>>>>>>>> existing
> > > > > >>>>>>>>>>>>>> client
> > > > > >>>>>>>>>>>>>>>>> APIs. Currently, there is no design for
> > > > > >>>> supporting
> > > > > >>>>>> APIs
> > > > > >>>>>>>>> like
> > > > > >>>>>>>>>>>>>>> transactions
> > > > > >>>>>>>>>>>>>>>>> and queues.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Jun
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > > > > >>>>>>>>>>>>>>>>> <[email protected]> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Would it be a good time to ask for the current
> > > > > >>>>>>> status of
> > > > > >>>>>>>>> this
> > > > > >>>>>>>>>>>> KIP?
> > > > > >>>>>>>>>>>>>> I
> > > > > >>>>>>>>>>>>>>>>>> haven't seen much activity here for the past 2
> > > > > >>>>>>> months,
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>> vote got
> > > > > >>>>>>>>>>>>>>>>> vetoed
> > > > > >>>>>>>>>>>>>>>>>> but I think the pending questions have been
> > > > > >>>>>> answered
> > > > > >>>>>>>>> since
> > > > > >>>>>>>>>>>> then.
> > > > > >>>>>>>>>>>>>>> KIP-1183
> > > > > >>>>>>>>>>>>>>>>>> (AutoMQ's proposal) also didn't have any
> > > > > >>>> activity
> > > > > >>>>>>> since
> > > > > >>>>>>>>> May.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> In my eyes KIP-1150 and KIP-1183 are two real
> > > > > >>>>>> choices
> > > > > >>>>>>>>> that can
> > > > > >>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>>> made, with a coordinator-based approach being
> > > > > >>>> by
> > > > > >>>>>> far
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>> dominant
> > > > > >>>>>>>>>>>>>> one
> > > > > >>>>>>>>>>>>>>>>> when
> > > > > >>>>>>>>>>>>>>>>>> it comes to market adoption - but all these are
> > > > > >>>>>>>>> standalone
> > > > > >>>>>>>>>>>>>> products.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> I'm a big fan of both approaches, but would
> > > > > >>>> hate to
> > > > > >>>>>>> see a
> > > > > >>>>>>>>>>>> stall. So
> > > > > >>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>>> question is: can we get an update?
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Maybe it's time to start another vote? Colin
> > > > > >>>>>> McCabe -
> > > > > >>>>>>>>> have your
> > > > > >>>>>>>>>>>>>>> questions
> > > > > >>>>>>>>>>>>>>>>>> been answered? If not, is there anything I can
> > > > > >>>> do
> > > > > >>>>>> to
> > > > > >>>>>>>>> help? I'm
> > > > > >>>>>>>>>>>>>> deeply
> > > > > >>>>>>>>>>>>>>>>>> familiar with both architectures and have
> > > > > >>>> written
> > > > > >>>>>>> about
> > > > > >>>>>>>>> both?
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Kind regards,
> > > > > >>>>>>>>>>>>>>>>>> Jan
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> On Tue, Jun 24, 2025 at 10:42 AM Stanislav
> > > > > >>>>>> Kozlovski
> > > > > >>>>>>> <
> > > > > >>>>>>>>>>>>>>>>>> [email protected]> wrote:
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> I have some nits - it may be useful to
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> a) group all the KIP email threads in the
> > > > > >>>> main
> > > > > >>>>>> one
> > > > > >>>>>>>>> (just a
> > > > > >>>>>>>>>>>> bunch
> > > > > >>>>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>>>> links
> > > > > >>>>>>>>>>>>>>>>>>> to everything)
> > > > > >>>>>>>>>>>>>>>>>>> b) create the email threads
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> It's a bit hard to track it all - for
> > > > > >>>> example, I
> > > > > >>>>>>> was
> > > > > >>>>>>>>>>>> searching
> > > > > >>>>>>>>>>>>>> for
> > > > > >>>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>>>>> discuss thread for KIP-1165 for a while; As
> > > > > >>>> far
> > > > > >>>>>> as
> > > > > >>>>>>> I
> > > > > >>>>>>>>> can
> > > > > >>>>>>>>>>>> tell, it
> > > > > >>>>>>>>>>>>>>>>> doesn't
> > > > > >>>>>>>>>>>>>>>>>>> exist yet.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Since the KIPs are published (by virtue of
> > > > > >>>> having
> > > > > >>>>>>> the
> > > > > >>>>>>>>> root
> > > > > >>>>>>>>>>>> KIP be
> > > > > >>>>>>>>>>>>>>>>>>> published, having a DISCUSS thread and links
> > > > > >>>> to
> > > > > >>>>>>>>> sub-KIPs
> > > > > >>>>>>>>>>>> where
> > > > > >>>>>>>>>>>>>> were
> > > > > >>>>>>>>>>>>>>>>> aimed
> > > > > >>>>>>>>>>>>>>>>>>> to move the discussion towards), I think it
> > > > > >>>> would
> > > > > >>>>>>> be
> > > > > >>>>>>>>> good to
> > > > > >>>>>>>>>>>>>> create
> > > > > >>>>>>>>>>>>>>>>>> DISCUSS
> > > > > >>>>>>>>>>>>>>>>>>> threads for them all.
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> Best,
> > > > > >>>>>>>>>>>>>>>>>>> Stan
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>> On 2025/04/16 11:58:22 Josep Prat wrote:
> > > > > >>>>>>>>>>>>>>>>>>>> Hi Kafka Devs!
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> We want to start a new KIP discussion about
> > > > > >>>>>>>>> introducing a
> > > > > >>>>>>>>>>>> new
> > > > > >>>>>>>>>>>>>>> type of
> > > > > >>>>>>>>>>>>>>>>>>>> topics that would make use of Object
> > > > > >>>> Storage as
> > > > > >>>>>>> the
> > > > > >>>>>>>>> primary
> > > > > >>>>>>>>>>>>>>> source of
> > > > > >>>>>>>>>>>>>>>>>>>> storage. However, as this KIP is big we
> > > > > >>>> decided
> > > > > >>>>>>> to
> > > > > >>>>>>>>> split it
> > > > > >>>>>>>>>>>>>> into
> > > > > >>>>>>>>>>>>>>>>>> multiple
> > > > > >>>>>>>>>>>>>>>>>>>> related KIPs.
> > > > > >>>>>>>>>>>>>>>>>>>> We have the motivational KIP-1150 (
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > > > >>>>>>>>>>>>>>>>>>> )
> > > > > >>>>>>>>>>>>>>>>>>>> that aims to discuss if Apache Kafka
> > > > > >>>> should aim
> > > > > >>>>>>> to
> > > > > >>>>>>>>> have
> > > > > >>>>>>>>>>>> this
> > > > > >>>>>>>>>>>>>>> type of
> > > > > >>>>>>>>>>>>>>>>>>>> feature at all. This KIP doesn't go onto
> > > > > >>>>>> details
> > > > > >>>>>>> on
> > > > > >>>>>>>>> how to
> > > > > >>>>>>>>>>>>>>> implement
> > > > > >>>>>>>>>>>>>>>>>> it.
> > > > > >>>>>>>>>>>>>>>>>>>> This follows the same approach used when we
> > > > > >>>>>>> discussed
> > > > > >>>>>>>>>>>> KRaft.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> But as we know that it is sometimes really
> > > > > >>>> hard
> > > > > >>>>>>> to
> > > > > >>>>>>>>> discuss
> > > > > >>>>>>>>>>>> on
> > > > > >>>>>>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>>>> meta
> > > > > >>>>>>>>>>>>>>>>>>>> level, we also created several sub-kips
> > > > > >>>> (linked
> > > > > >>>>>>> in
> > > > > >>>>>>>>>>>> KIP-1150)
> > > > > >>>>>>>>>>>>>> that
> > > > > >>>>>>>>>>>>>>>>> offer
> > > > > >>>>>>>>>>>>>>>>>>> an
> > > > > >>>>>>>>>>>>>>>>>>>> implementation of this feature.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> We kindly ask you to use the proper DISCUSS
> > > > > >>>>>>> threads
> > > > > >>>>>>>>> for
> > > > > >>>>>>>>>>>> each
> > > > > >>>>>>>>>>>>>>> type of
> > > > > >>>>>>>>>>>>>>>>>>>> concern and keep this one to discuss
> > > > > >>>> whether
> > > > > >>>>>>> Apache
> > > > > >>>>>>>>> Kafka
> > > > > >>>>>>>>>>>> wants
> > > > > >>>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>>>> have
> > > > > >>>>>>>>>>>>>>>>>>>> this feature or not.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> Thanks in advance on behalf of all the
> > > > > >>>> authors
> > > > > >>>>>> of
> > > > > >>>>>>>>> this KIP.
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>> ------------------
> > > > > >>>>>>>>>>>>>>>>>>>> Josep Prat
> > > > > >>>>>>>>>>>>>>>>>>>> Open Source Engineering Director, Aiven
> > > > > >>>>>>>>>>>>>>>>>>>> [email protected]   |   +491715557497 |
> > > > > >>>>>>> aiven.io
> > > > > >>>>>>>>>>>>>>>>>>>> Aiven Deutschland GmbH
> > > > > >>>>>>>>>>>>>>>>>>>> Alexanderufer 3-7, 10117 Berlin
> > > > > >>>>>>>>>>>>>>>>>>>> Geschäftsführer: Oskari Saarenmaa, Hannu
> > > > > >>>>>>> Valtonen,
> > > > > >>>>>>>>>>>>>>>>>>>> Anna Richardson, Kenneth Chen
> > > > > >>>>>>>>>>>>>>>>>>>> Amtsgericht Charlottenburg, HRB 209739 B
> > > > > >>>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Anatolii Popov
> Senior Software Developer, *Aiven OY*
> m: +358505126242
> w: aiven.io  e: [email protected]
> <https://www.facebook.com/aivencloud>
> <https://www.linkedin.com/company/aiven/>   <https://twitter.com/aiven_io>
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to