Hi all,

Thank you for the extensive feedback.

We have substantially updated the KIP to address the points raised. Given
the scope of these changes, we have started a new VOTE thread to restart
the voting process cleanly.

You can find the new thread here:
https://lists.apache.org/thread/m42nj2qm5z4w2x8kt7x4kgghfzrdwl7q

Best,
Anatolii

On Mon, Dec 15, 2025 at 9:47 PM Thomas Thornton via dev <
[email protected]> wrote:

> Hi all,
>
> We (the team at Slack) have been following the recent discussion regarding
> the scope and timeline of KIP-1150. We agree with the community sentiment
> that Diskless Topics represents the right long-term architecture for Kafka
> in the cloud, but we also recognize the valid concerns raised regarding the
> engineering resources required to deliver such an ambitious change.
>
> To help address these concerns and accelerate the timeline, we are happy to
> announce that we are partnering with the KIP-1150 authors to co-develop
> this feature.
>
> We previously proposed KIP-1176: Tiered Storage for Active Log Segments to
> solve similar problems. However, rather than fragmenting the community's
> efforts across competing designs, we have decided to withdraw KIP-1176 and
> consolidate our engineering resources behind KIP-1150.
>
> To start, we plan to take ownership of Compaction for Tiered Storage. This
> has long been a missing feature in KIP-405, and it becomes critical in a
> Diskless architecture where long-lived data is obligated to be tiered. By
> driving this related prerequisite feature, we hope to allow for faster
> delivery of KIP-1150.
>
> We are excited to collaborate on this to ensure a robust and timely
> delivery for the community.
>
> Best,
> Tom & Henry
>
> On Fri, Nov 14, 2025 at 8:35 AM Luke Chen <[email protected]> wrote:
>
> > Hi Greg,
> >
> > Thanks for sharing the meeting notes.
> > I agree we should keep polishing the contents of 1150 & high level design
> > in 1163 to prepare for a vote.
> >
> >
> > Thanks.
> > Luke
> >
> > On Fri, Nov 14, 2025 at 3:54 AM Greg Harris <[email protected]
> >
> > wrote:
> >
> > > Hi all,
> > >
> > > There was a video call between myself, Ivan Yurchenko, Jun Rao, and
> > Andrew
> > > Schofield pertaining to KIP-1150. Here are the notes from that meeting:
> > >
> > > Ivan: What is the future state of Kafka in this area, in 5 years?
> > > Jun: Do we want something more cloud native? Yes, started with Tiered
> > > Storage. If there’s a better way, we should explore it. In the long
> term
> > > this will be useful
> > > Because Kafka is used so widely, we need to make sure everything we add
> > is
> > > for the long term and for everyone, not just for a single company.
> > > When we add TS, it doesn’t just solve Uber’s use-case. We want
> something
> > > that’s high quality/lasts/maintainable, and can work with all existing
> > > capabilities.
> > > If both 1150 and 1176 proceed at the same time, it’s confusing. They
> > > overlap, but Diskless is more ambitious.
> > > If both KIPs are being seriously worked on, then we don’t really need
> > both,
> > > because Diskless clearly is better. Having multiple will confuse
> people.
> > It
> > > will duplicate some of the effort.
> > > If we want diskless ultimately, what is the short term strategy, to get
> > > some early wins first?
> > > Ivan: Andrew, do you want a more revolutionary approach?
> > > Andrew: Eventually the architecture will change substantially, it may
> not
> > > be necessary to put all of that bill onto Diskless at once.
> > > Greg: We all agree on having a high quality feature merged upstream,
> and
> > > supporting all APIs
> > > Jun: We should try and keep things simple, but there is some minimum
> > > complexity needed.
> > > When doing the short term changes (1176), it doesn’t really progress in
> > > changing to a more modern architecture.
> > > Greg: Was TS+Compaction the only feature miss we’ve had so far?
> > > Jun: The danger of only applying changes to some part of the API, you
> set
> > > the precedence that you only have to implement part of the API.
> > Supporting
> > > the full API set should be a minimum requirement.
> > > Andrew: When we started Kraft, how much did we know the design?
> > > Jun: For Kraft we didn’t really know much about the migration, but the
> > > high-level was clear.
> > > Greg: Is 1150 votable in its current state?
> > > Jun: 1150 should promise to support all APIs. It doesn’t have to have
> all
> > > the details/apis/etc. KIP-500 didn’t have it.
> > > We do need some high-level design enough to give confidence that the
> > > promise is able to be fulfilled.
> > > Greg: Is the draft version in 1163 enough detail or is more needed?
> > > Jun: We need to agree on the core design, such as leaderless etc. And
> how
> > > the APIs will be supported.
> > > Greg: Okay we can include these things, and provide a sketch of how the
> > > other leader-based features operate.
> > > Jun: Yeah if at a high level the sketch appears to work, we can approve
> > > that functionality.
> > > Are you committed to doing the more involved and big project?
> > > Greg: Yes, we’re committed to the 1163 design and can’t really accept
> > 1176.
> > > Jun: TS was slow because of Uber resourcing problems
> > > Greg: We’ll push internally for resources, and use the community
> > sentiment
> > > to motivate Aiven.
> > > How far into the future should we look? What sort of scale?
> > > Jun: As long as there’s a path forward, and we’re not closing off
> future
> > > improvements, we can figure out how to handle a larger scale when it
> > > arises.
> > > Greg: Random replica placement is very harmful, can we recommend users
> to
> > > use an external tool like CruiseControl?
> > > Jun: Not everyone uses CruiseControl, we would probably need some
> > solution
> > > for this out of the box
> > > Ivan: Should the Batch Coordinator be pluggable?
> > > Jun: Out-of-box experience should be good, good to allow other
> > > implementations
> > > Greg: But it could hurt Kafka feature/upgrade velocity when we wait for
> > > plugin providers to implement it
> > > Ivan: We imagined that maybe cloud hyperscalers could implement it with
> > > e.g. dynamodb
> > > Greg: Could we bake more details of the different providers into Kafka,
> > or
> > > does it still make sense for it to be pluggable?
> > > Jun: Make it whatever is easiest to roll out and add new clients
> > > Andrew: What happens next? Do you want to get KIP-1150 voted?
> > > Ivan: The vote is already open, we’re not too pressed for time. We’ll
> go
> > > improve the 1163 design and communication.
> > > Is 1176 a competing design? Someone will ask.
> > > Jun: If we are seriously working on something more ambitious, yeah we
> > > shouldn’t do the stop-gap solution.
> > > It’s diverting review resources. If we can get the short term thing in
> > 1yr
> > > but Diskless solution is 2y it makes sense to go for Diskless. If it’s
> > 5yr,
> > > that’s different and maybe the stop-gap solution is needed.
> > > Greg: I’m biased but I believe we’re in the 1yr/2yr case. Should we
> > > explicitly exclude 1176?
> > > Andrew: Put your arms around the feature set you actually want, and use
> > > that to rule out 1176.
> > > Probably don’t need -1 votes, most likely KIPs just don’t receive
> votes.
> > > Ivan: Should we have sync meetings like tiered storage did?
> > > Jun: Satish posted meeting notes regularly, we should do the same.
> > >
> > > To summarize, we will be polishing the contents of 1150 & high level
> > design
> > > in 1163 to prepare for a vote.
> > > We believe that the community should select the feature set of 1150 to
> > > fully eliminate producer cross-zone costs, and make the investment in a
> > > high quality Diskless Topics implementation rather than in stop-gap
> > > solutions.
> > >
> > > Thanks,
> > > Greg
> > >
> > > On Fri, Nov 7, 2025 at 9:19 PM Max fortun <[email protected]> wrote:
> > >
> > > > This may be a tangent, but we needed to offload storage off of Kafka
> > into
> > > > S3. We are keeping Kafka not as a source of truth, but as a mostly
> > > > ephemeral broker that can come and go as it pleases. Be that scaling
> or
> > > > outage. Disks can be destroyed and recreated at will, we still retain
> > > data
> > > > and use broker for just that, brokering messages. Not only that, we
> > > reduced
> > > > the requirement on the actual Kafka resources by reducing the size
> of a
> > > > payload via a claim check pattern. Maybe this is an anti–pattern, but
> > it
> > > is
> > > > super fast and highly cost efficient. We reworked ProducerRequest to
> > > allow
> > > > plugins. We added a custom http plugin that submits every request
> via a
> > > > persisted connection to a microservice. Microservice stores the
> payload
> > > and
> > > > returns a tiny json metadata object,a claim check, that can be used
> to
> > > find
> > > > the actual data. Think of it as zipping the payload. This claim check
> > > > metadata traverses the pipelines with consumers using the urls in
> > > metadata
> > > > to pull what they need. Think unzipping. This allowed us to also pull
> > > ONLY
> > > > the data that we need in graphql like manner. So if you have a 100K
> > json
> > > > payload and you need only a subsection, you can pull that by
> jmespath.
> > > When
> > > > you have multiple consumer groups yanking down huge payloads it is
> > > > cumbersome on the broker. When you have the same consumer groups
> > yanking
> > > > down a claim check, and then going out of band directly to the source
> > of
> > > > truth, the broker has some breathing room. Obviously our microservice
> > > does
> > > > not go directly to the cloud storage, as that would be too slow. It
> > > stores
> > > > the payload in high speed memory cache and returns asap. That memory
> is
> > > > eventually persisted into S3. The retrieval goest against the cache
> > > first,
> > > > then against the S3. Overall a rather cheappy and zippy solution. I
> > tried
> > > > proposing the KIP for this, but there was no excitement. Check this
> > out:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=318606528
> > > >
> > > >
> > > > > On Nov 7, 2025, at 5:49 PM, Jun Rao <[email protected]>
> > wrote:
> > > > >
> > > > > Hi, Andrew,
> > > > >
> > > > > If we want to focus only on reducing cross-zone replication costs,
> > > there
> > > > is
> > > > > an alternative design in the KIP-1176 discussion thread that seems
> > > > simpler
> > > > > than the proposal here. I am copying the outline of that approach
> > > below.
> > > > >
> > > > > 1. A new leader is elected.
> > > > > 2. Leader maintains a first tiered offset, which is initialized to
> > log
> > > > end
> > > > > offset.
> > > > > 3. Leader writes produced data from the client to local log.
> > > > > 4. Leader uploads produced data from all local logs as a combined
> > > object
> > > > > 5. Leader stores the metadata for the combined object in memory.
> > > > > 6. If a follower fetch request has an offset >= first tiered
> offset,
> > > the
> > > > > metadata for the corresponding combined object is returned.
> > Otherwise,
> > > > the
> > > > > local data is returned.
> > > > > 7. Leader periodically advances first tiered offset.
> > > > >
> > > > > It's still a bit unnatural, but it could work.
> > > > >
> > > > > Hi, Ivan,
> > > > >
> > > > > Are you still committed to proceeding with the original design of
> > > > KIP-1150?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Sun, Nov 2, 2025 at 6:00 AM Andrew Schofield <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > >> Hi,
> > > > >> I’ve been following KIP-1150 and friends for a while. I’m going to
> > > jump
> > > > >> into the discussions too.
> > > > >>
> > > > >> Looking back at Jack Vanlightly’s message, I am not quite so
> > convinced
> > > > >> that it’s a kind of fork in the road. The primary aim of the
> effort
> > is
> > > > to
> > > > >> reduce cross-zone replication costs so Apache Kafka is not
> > > prohibitively
> > > > >> expensive to use on cloud storage. I think it would be entirely
> > viable
> > > > to
> > > > >> prioritise code reuse for an initial implementation of diskless
> > > topics,
> > > > and
> > > > >> we could still have a more cloud-native design in the future. It’s
> > > hard
> > > > to
> > > > >> predict what the community will prioritise in the future.
> > > > >>
> > > > >> Of the three major revisions, I’m in the rev3 camp. We can support
> > > > >> leaderless produce requests, first writing WAL segments into
> object
> > > > >> storage, and then using the regular partition leaders to sequence
> > the
> > > > >> records. The active log segment for a diskless topic will
> initially
> > > > contain
> > > > >> batch coordinates rather than record batches. The batch
> coordinates
> > > can
> > > > be
> > > > >> resolved from WAL segments for consumers, and also in order to
> > prepare
> > > > log
> > > > >> segments for uploading to tiered storage. Jun is probably correct
> > that
> > > > we
> > > > >> need a more frequent object merging process than tiered storage
> > > > provides.
> > > > >> This is just the transition from write-optimised WAL segments to
> > > > >> read-optimised tiered segments, and all of the object
> storage-based
> > > > >> implementations of Kafka that I’m aware of do this rearrangement.
> > But
> > > > >> perhaps this more frequent object merging is a pre-GA improvement,
> > > > rather
> > > > >> than a strict requirement for an initial implementation for early
> > > access
> > > > >> use.
> > > > >>
> > > > >> For zone-aligned share consumers, the share group assignor is
> > intended
> > > > to
> > > > >> be rack-aware. Consumers should be assigned to partitions with
> > leaders
> > > > in
> > > > >> their zone. The simple assignor is not rack-aware, but it easily
> > could
> > > > be
> > > > >> or we could have a rack-aware assignor.
> > > > >>
> > > > >> Thanks,
> > > > >> Andrew
> > > > >>
> > > > >>
> > > > >>> On 24 Oct 2025, at 23:14, Jun Rao <[email protected]>
> > wrote:
> > > > >>>
> > > > >>> Hi, Ivan,
> > > > >>>
> > > > >>> Thanks for the reply.
> > > > >>>
> > > > >>> "As I understand, you’re speaking about locally materialized
> > > segments.
> > > > >> They
> > > > >>> will indeed consume some IOPS. See them as a cache that could
> > always
> > > be
> > > > >>> restored from the remote storage. While it’s not ideal, it's
> still
> > OK
> > > > to
> > > > >>> lose data in them due to a machine crash, for example. Because of
> > > this,
> > > > >> we
> > > > >>> can avoid explicit flushing on local materialized segments at all
> > and
> > > > let
> > > > >>> the file system and page cache figure out when to flush
> optimally.
> > > This
> > > > >>> would not eliminate the extra IOPS, but should reduce it
> > > dramatically,
> > > > >>> depending on throughput for each partition. We, of course,
> continue
> > > > >>> flushing the metadata segments as before."
> > > > >>>
> > > > >>> If we have a mix of classic and diskless topics on the same
> broker,
> > > > it's
> > > > >>> important that the classic topics' data is flushed to disk as
> > quickly
> > > > as
> > > > >>> possible. To achieve this, users typically set
> > dirty_expire_centisecs
> > > > in
> > > > >>> the kernel based on the number of available disk IOPS. Once you
> set
> > > > this
> > > > >>> number, it applies to all dirty files, including the cached data
> in
> > > > >>> diskless topics. So, if there are more files actively
> accumulating
> > > > data,
> > > > >>> the flush frequency and therefore RPO is reduced for classic
> > topics.
> > > > >>>
> > > > >>> "We should have mentioned this explicitly, but this step, in
> fact,
> > > > >> remains
> > > > >>> in the form of segments offloading to tiered storage. When we
> > > assemble
> > > > a
> > > > >>> segment and hand it over to RemoteLogManager, we’re effectively
> > doing
> > > > >>> metadata compaction: replacing a big number of pieces of metadata
> > > about
> > > > >>> individual batches with a single record in
> __remote_log_metadata."
> > > > >>>
> > > > >>> The object merging in tier storage typically only kicks in after
> a
> > > few
> > > > >>> hours. The impact is (1) the amount of accumulated metadata is
> > still
> > > > >> quite
> > > > >>> large; (2) there are many small objects, leading to poor read
> > > > >> performance.
> > > > >>> I think we need a more frequent object merging process than tier
> > > > storage
> > > > >>> provides.
> > > > >>>
> > > > >>> Jun
> > > > >>>
> > > > >>>
> > > > >>> On Thu, Oct 23, 2025 at 10:12 AM Ivan Yurchenko <[email protected]>
> > > > wrote:
> > > > >>>
> > > > >>>> Hello Jack, Jun, Luke, and all!
> > > > >>>>
> > > > >>>> Thank you for your messages.
> > > > >>>>
> > > > >>>> Let me first address some of Jun’s comments.
> > > > >>>>
> > > > >>>>> First, it degrades the durability.
> > > > >>>>> For each partition, now there are two files being actively
> > written
> > > > at a
> > > > >>>>> given point of time, one for the data and another for the
> > metadata.
> > > > >>>>> Flushing each file requires a separate IO. If the disk has 1K
> > IOPS
> > > > and
> > > > >> we
> > > > >>>>> have 5K partitions in a broker, currently we can afford to
> flush
> > > each
> > > > >>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If we
> > > > double
> > > > >>>> the
> > > > >>>>> number of files per partition, we can only flush each partition
> > > every
> > > > >> 10
> > > > >>>>> seconds, which makes RPO twice as bad.
> > > > >>>>
> > > > >>>> As I understand, you’re speaking about locally materialized
> > > segments.
> > > > >> They
> > > > >>>> will indeed consume some IOPS. See them as a cache that could
> > always
> > > > be
> > > > >>>> restored from the remote storage. While it’s not ideal, it's
> still
> > > OK
> > > > to
> > > > >>>> lose data in them due to a machine crash, for example. Because
> of
> > > > this,
> > > > >> we
> > > > >>>> can avoid explicit flushing on local materialized segments at
> all
> > > and
> > > > >> let
> > > > >>>> the file system and page cache figure out when to flush
> optimally.
> > > > This
> > > > >>>> would not eliminate the extra IOPS, but should reduce it
> > > dramatically,
> > > > >>>> depending on throughput for each partition. We, of course,
> > continue
> > > > >>>> flushing the metadata segments as before.
> > > > >>>>
> > > > >>>> It’s worth making a note on caching. I think nobody will
> disagree
> > > that
> > > > >>>> doing direct reads from remote storage every time a batch is
> > > requested
> > > > >> by a
> > > > >>>> consumer will not be practical neither from the performance nor
> > from
> > > > the
> > > > >>>> economy point of view. We need a way to keep the number of GET
> > > > requests
> > > > >>>> down. There are multiple options, for example:
> > > > >>>> 1. Rack-aware distributed in-memory caching.
> > > > >>>> 2. Local in-memory caching. Comes with less network chattiness
> and
> > > > >> works
> > > > >>>> well if we have more or less stable brokers to consume from.
> > > > >>>> 3. Materialization of diskless logs on local disk. Way lower
> > impact
> > > on
> > > > >>>> RAM and also requires stable brokers for consumption (using just
> > > > >> assigned
> > > > >>>> replicas will probably work well).
> > > > >>>>
> > > > >>>> Materialization is one of possible options, but we can choose
> > > another
> > > > >> one.
> > > > >>>> However, we will have this dilemma regardless of whether we have
> > an
> > > > >>>> explicit coordinator or we go “coordinator-less”.
> > > > >>>>
> > > > >>>>> Second, if we ever need this
> > > > >>>>> metadata somewhere else, say in the WAL file manager, the
> > consumer
> > > > >> needs
> > > > >>>> to
> > > > >>>>> subscribe to every partition in the cluster, which is
> > inefficient.
> > > > The
> > > > >>>>> actual benefit of this approach is also questionable. On the
> > > surface,
> > > > >> it
> > > > >>>>> might seem that we could reduce the number of lines that need
> to
> > be
> > > > >>>> changed
> > > > >>>>> for this KIP. However, the changes are quite intrusive to the
> > > classic
> > > > >>>>> partition's code path and will probably make the code base
> harder
> > > to
> > > > >>>>> maintain in the long run. I like the original approach based on
> > the
> > > > >> batch
> > > > >>>>> coordinator much better than this one. We could probably
> refactor
> > > the
> > > > >>>>> producer state code so that it could be reused in the batch
> > > > >> coordinator.
> > > > >>>>
> > > > >>>> It’s hard to disagree with this. The explicit coordinator is
> more
> > a
> > > > side
> > > > >>>> thing, while coordinator-less approach is more about extending
> > > > >>>> ReplicaManager, UnifiedLog and others substantially.
> > > > >>>>
> > > > >>>>> Thanks for addressing the concerns on the number of RPCs in the
> > > > produce
> > > > >>>>> path. I agree that with the metadata crafting mechanism, we
> could
> > > > >>>> mitigate
> > > > >>>>> the PRC problem. However, since we now require the metadata to
> be
> > > > >>>>> collocated with the data on the same set of brokers, it's weird
> > > that
> > > > >> they
> > > > >>>>> are now managed by different mechanisms. The data assignment
> now
> > > uses
> > > > >> the
> > > > >>>>> metadata crafting mechanism, but the metadata is stored in the
> > > > classic
> > > > >>>>> partition using its own assignment strategy. It will be
> > complicated
> > > > to
> > > > >>>> keep
> > > > >>>>> them collocated.
> > > > >>>>
> > > > >>>> I would like to note that the metadata crafting is needed only
> to
> > > tell
> > > > >>>> producers which brokers they should send Produce requests to,
> but
> > > data
> > > > >> (as
> > > > >>>> in “locally materialized log”) is located on partition replicas,
> > > i.e.
> > > > >>>> automatically co-located with metadata.
> > > > >>>>
> > > > >>>> As a side note, it would probably be better that instead of
> > > implicitly
> > > > >>>> crafting partition metadata, we extend the metadata protocol so
> > that
> > > > for
> > > > >>>> diskless partitions we return not only the leader and replicas,
> > but
> > > > also
> > > > >>>> some “recommended produce brokers”, selected for optimal
> > performance
> > > > and
> > > > >>>> costs. Producers will pick ones in their racks.
> > > > >>>>
> > > > >>>>> I am also concerned about the removal of the object
> > > > compaction/merging
> > > > >>>>> step.
> > > > >>>>
> > > > >>>> We should have mentioned this explicitly, but this step, in
> fact,
> > > > >> remains
> > > > >>>> in the form of segments offloading to tiered storage. When we
> > > > assemble a
> > > > >>>> segment and hand it over to RemoteLogManager, we’re effectively
> > > doing
> > > > >>>> metadata compaction: replacing a big number of pieces of
> metadata
> > > > about
> > > > >>>> individual batches with a single record in
> __remote_log_metadata.
> > > > >>>>
> > > > >>>> We could create a Diskless-specific merging mechanism instead if
> > > > needed.
> > > > >>>> It’s rather easy with the explicit coordinator approach. With
> the
> > > > >>>> coordinator-less approach, this would probably be a bit more
> > tricky
> > > > >>>> (rewriting the tail of the log by the leader + replicating this
> > > change
> > > > >>>> reliably).
> > > > >>>>
> > > > >>>>> I see a tendency toward primarily optimizing for the fewest
> code
> > > > >> changes
> > > > >>>> in
> > > > >>>>> the KIP. Instead, our primary goal should be a clean design
> that
> > > can
> > > > >> last
> > > > >>>>> for the long term.
> > > > >>>>
> > > > >>>> Yes, totally agree.
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Luke,
> > > > >>>>> I'm wondering if the complexity of designing txn and queue is
> > > because
> > > > >> of
> > > > >>>>> leaderless cluster, do you think it will be simpler if we only
> > > focus
> > > > on
> > > > >>>> the
> > > > >>>>> "diskless" design to handle object compaction/merging to/from
> the
> > > > >> remote
> > > > >>>>> storage to save the cross-AZ cost?
> > > > >>>>
> > > > >>>> After some evolution of the original proposal, leaderless is now
> > > > >> limited.
> > > > >>>> We only need to be able to accept Produce requests on more than
> > one
> > > > >> broker
> > > > >>>> to eliminate the cross-AZ costs for producers. Do I get it right
> > > that
> > > > >> you
> > > > >>>> propose to get rid of this? Or do I misunderstand?
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Let’s now look at this problem from a higher level, as Jack
> > > proposed.
> > > > As
> > > > >>>> it was said, the big choice we need to make is whether we 1)
> > create
> > > an
> > > > >>>> explicit batch coordinator; or 2) go for the coordinator-less
> > > > approach,
> > > > >>>> where each diskless partition is managed by its leader as in
> > classic
> > > > >> topics.
> > > > >>>>
> > > > >>>> If we try to compare the two approaches:
> > > > >>>>
> > > > >>>> Pluggability:
> > > > >>>> - Explicit coordinator: Possible. For example, some setups may
> > > benefit
> > > > >>>> from batch metadata being stored in a cloud database (such as
> AWS
> > > > >> DynamoDB
> > > > >>>> or GCP Spanner).
> > > > >>>> - Coordinator-less: Impossible.
> > > > >>>>
> > > > >>>> Scalability and fault tolerance:
> > > > >>>> - Explicit coordinator: Depends on the implementation and it’s
> > also
> > > > >>>> necessary to actively work for it.
> > > > >>>> - Coordinator-less: Closer to classic Kafka topics. Scaling is
> > done
> > > by
> > > > >>>> partition placement, partitions could fail independently.
> > > > >>>>
> > > > >>>> Separation of concerns:
> > > > >>>> - Explicit coordinator: Very good. Diskless remains more
> > independent
> > > > >> from
> > > > >>>> classic topics in terms of code and workflows. For example, the
> > > > >>>> above-mentioned non-tiered storage metadata compaction mechanism
> > > could
> > > > >> be
> > > > >>>> relatively simply implemented with it. As a flip side of this,
> > some
> > > > >>>> workflows (e.g. transactions) will have to be adapted.
> > > > >>>> - Coordinator-less: Less so. It leans to the opposite: bringing
> > > > diskless
> > > > >>>> closer to classic topics. Some code paths and workflows could be
> > > more
> > > > >>>> straightforwardly reused, but they will inevitably have to be
> > > adapted
> > > > to
> > > > >>>> accommodate both topic types as also discussed.
> > > > >>>>
> > > > >>>> Cloud-nativeness. This is a vague concept, also related to the
> > > > previous,
> > > > >>>> but let’s try:
> > > > >>>> - Explicit coordinator: Storing and processing metadata
> separately
> > > > makes
> > > > >>>> it easier for brokers to take different roles, be purely
> stateless
> > > if
> > > > >>>> needed, etc.
> > > > >>>> - Coordinator-less: Less so. Something could be achieved with
> > > creative
> > > > >>>> partition placement, but not much.
> > > > >>>>
> > > > >>>> Both seem to have their pros and cons. However, answering Jack’s
> > > > >> question,
> > > > >>>> the explicit coordinator approach may indeed lead to a more
> > flexible
> > > > >> design.
> > > > >>>>
> > > > >>>>
> > > > >>>> The purpose of this deviation in the discussion was to receive a
> > > > >>>> preliminary community evaluation of the coordinator-less
> approach
> > > > >> without
> > > > >>>> taking on the task of writing a separate KIP and fitting it in
> the
> > > > >> system
> > > > >>>> of KIP-1150 and its children. We’re open to stopping it and
> > getting
> > > > >> back to
> > > > >>>> working out the coordinator design if the community doesn’t
> favor
> > > the
> > > > >>>> proposed approach.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Ivan and Diskless team
> > > > >>>>
> > > > >>>> On Mon, Oct 20, 2025, at 05:58, Luke Chen wrote:
> > > > >>>>> Hi Ivan,
> > > > >>>>>
> > > > >>>>> As Jun pointed out, the updated design seems to have some
> > > > shortcomings
> > > > >>>>> although it simplifies the implementation.
> > > > >>>>>
> > > > >>>>> I'm wondering if the complexity of designing txn and queue is
> > > because
> > > > >> of
> > > > >>>>> leaderless cluster, do you think it will be simpler if we only
> > > focus
> > > > on
> > > > >>>> the
> > > > >>>>> "diskless" design to handle object compaction/merging to/from
> the
> > > > >> remote
> > > > >>>>> storage to save the cross-AZ cost?
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> Thank you,
> > > > >>>>> Luke
> > > > >>>>>
> > > > >>>>> On Sat, Oct 18, 2025 at 5:22 AM Jun Rao
> <[email protected]
> > >
> > > > >>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi, Ivan,
> > > > >>>>>>
> > > > >>>>>> Thanks for the explanation.
> > > > >>>>>>
> > > > >>>>>> "we write the reference to the WAL file with the batch data"
> > > > >>>>>>
> > > > >>>>>> I understand the approach now, but I think it is a hacky one.
> > > There
> > > > >> are
> > > > >>>>>> multiple short comings with this design. First, it degrades
> the
> > > > >>>> durability.
> > > > >>>>>> For each partition, now there are two files being actively
> > written
> > > > at
> > > > >> a
> > > > >>>>>> given point of time, one for the data and another for the
> > > metadata.
> > > > >>>>>> Flushing each file requires a separate IO. If the disk has 1K
> > IOPS
> > > > and
> > > > >>>> we
> > > > >>>>>> have 5K partitions in a broker, currently we can afford to
> flush
> > > > each
> > > > >>>>>> partition every 5 seconds, achieving an RPO of 5 seconds. If
> we
> > > > double
> > > > >>>> the
> > > > >>>>>> number of files per partition, we can only flush each
> partition
> > > > every
> > > > >>>> 10
> > > > >>>>>> seconds, which makes RPO twice as bad. Second, if we ever need
> > > this
> > > > >>>>>> metadata somewhere else, say in the WAL file manager, the
> > consumer
> > > > >>>> needs to
> > > > >>>>>> subscribe to every partition in the cluster, which is
> > inefficient.
> > > > The
> > > > >>>>>> actual benefit of this approach is also questionable. On the
> > > > surface,
> > > > >>>> it
> > > > >>>>>> might seem that we could reduce the number of lines that need
> to
> > > be
> > > > >>>> changed
> > > > >>>>>> for this KIP. However, the changes are quite intrusive to the
> > > > classic
> > > > >>>>>> partition's code path and will probably make the code base
> > harder
> > > to
> > > > >>>>>> maintain in the long run. I like the original approach based
> on
> > > the
> > > > >>>> batch
> > > > >>>>>> coordinator much better than this one. We could probably
> > refactor
> > > > the
> > > > >>>>>> producer state code so that it could be reused in the batch
> > > > >>>> coordinator.
> > > > >>>>>>
> > > > >>>>>> Thanks for addressing the concerns on the number of RPCs in
> the
> > > > >> produce
> > > > >>>>>> path. I agree that with the metadata crafting mechanism, we
> > could
> > > > >>>> mitigate
> > > > >>>>>> the PRC problem. However, since we now require the metadata to
> > be
> > > > >>>>>> collocated with the data on the same set of brokers, it's
> weird
> > > that
> > > > >>>> they
> > > > >>>>>> are now managed by different mechanisms. The data assignment
> now
> > > > uses
> > > > >>>> the
> > > > >>>>>> metadata crafting mechanism, but the metadata is stored in the
> > > > classic
> > > > >>>>>> partition using its own assignment strategy. It will be
> > > complicated
> > > > to
> > > > >>>> keep
> > > > >>>>>> them collocated.
> > > > >>>>>>
> > > > >>>>>> I am also concerned about the removal of the object
> > > > compaction/merging
> > > > >>>>>> step. My first concern is on the amount of metadata that need
> to
> > > be
> > > > >>>> kept.
> > > > >>>>>> Without object compcation, the metadata generated in the
> produce
> > > > path
> > > > >>>> can
> > > > >>>>>> only be deleted after remote tiering kicks in. Let's say for
> > every
> > > > >>>> 250ms we
> > > > >>>>>> produce 100 byte of metadata per partition. Let's say remoting
> > > > tiering
> > > > >>>>>> kicks in after 5 hours. In a cluster with 100K partitions, we
> > need
> > > > to
> > > > >>>> keep
> > > > >>>>>> about 100 * (1 / 0.2)  * 5 * 3600 * 100K = 720 GB metadata,
> > quite
> > > > >>>>>> signficant. A second concern is on performance. Every time we
> > need
> > > > to
> > > > >>>>>> rebuild the caching data, we need to read a bunch of small
> > objects
> > > > >>>> from S3,
> > > > >>>>>> slowing down the building process. If a consumer happens to
> need
> > > > such
> > > > >>>> data,
> > > > >>>>>> it could slow down the application.
> > > > >>>>>>
> > > > >>>>>> I see a tendency toward primarily optimizing for the fewest
> code
> > > > >>>> changes in
> > > > >>>>>> the KIP. Instead, our primary goal should be a clean design
> that
> > > can
> > > > >>>> last
> > > > >>>>>> for the long term.
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>>
> > > > >>>>>> Jun
> > > > >>>>>>
> > > > >>>>>> On Tue, Oct 14, 2025 at 11:02 AM Ivan Yurchenko <
> [email protected]
> > >
> > > > >>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hi Jun,
> > > > >>>>>>>
> > > > >>>>>>> Thank you for your message. I’m sorry that I failed to
> clearly
> > > > >>>> explain
> > > > >>>>>> the
> > > > >>>>>>> idea. Let me try to fix this.
> > > > >>>>>>>
> > > > >>>>>>>> Does each partition now have a metadata partition and a
> > separate
> > > > >>>> data
> > > > >>>>>>>> partition? If so, I am concerned that it essentially doubles
> > the
> > > > >>>> number
> > > > >>>>>>> of
> > > > >>>>>>>> partitions, which impacts the number of open file
> descriptors
> > > and
> > > > >>>> the
> > > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a
> > > > separate
> > > > >>>>>>>> partition just to store the metadata. It's as if we are
> > creating
> > > > an
> > > > >>>>>>>> internal topic with an unbounded number of partitions.
> > > > >>>>>>>
> > > > >>>>>>> No. There will be only one physical partition per diskless
> > > > >>>> partition. Let
> > > > >>>>>>> me explain this with an example. Let’s say we have a diskless
> > > > >>>> partition
> > > > >>>>>>> topic-0. It has three replicas 0, 1, 2; 0 is the leader. We
> > > produce
> > > > >>>> some
> > > > >>>>>>> batches to this partition. The content of the segment file
> will
> > > be
> > > > >>>>>>> something like this (for each batch):
> > > > >>>>>>>
> > > > >>>>>>> BaseOffset: 00000000000000000000 (like in classic)
> > > > >>>>>>> Length: 123456 (like in classic)
> > > > >>>>>>> PartitionLeaderEpoch: like in classic
> > > > >>>>>>> Magic: like in classic
> > > > >>>>>>> CRC: like in classic
> > > > >>>>>>> Attributes: like in classic
> > > > >>>>>>> LastOffsetDelta: like in classic
> > > > >>>>>>> BaseTimestamp: like in classic
> > > > >>>>>>> MaxTimestamp: like in classic
> > > > >>>>>>> ProducerId: like in classic
> > > > >>>>>>> ProducerEpoch: like in classic
> > > > >>>>>>> BaseSequence: like in classic
> > > > >>>>>>> RecordsCount: like in classic
> > > > >>>>>>> Records:
> > > > >>>>>>> path/to/wal/files/5b55c4bb-f52a-4204-aea6-81226895158a; byte
> > > offset
> > > > >>>>>>> 123456
> > > > >>>>>>>
> > > > >>>>>>> It looks very much like classic log entries. The only
> > difference
> > > is
> > > > >>>> that
> > > > >>>>>>> instead of writing real Records, we write the reference to
> the
> > > WAL
> > > > >>>> file
> > > > >>>>>>> with the batch data (I guess we need only the name and the
> byte
> > > > >>>> offset,
> > > > >>>>>>> because the byte length is the standard field above).
> > Otherwise,
> > > > >>>> it’s a
> > > > >>>>>>> normal Kafka log with the leader and replicas.
> > > > >>>>>>>
> > > > >>>>>>> So we have as many partitions for diskless as for classic. As
> > of
> > > > open
> > > > >>>>>> file
> > > > >>>>>>> descriptors, let’s proceed to the following:
> > > > >>>>>>>
> > > > >>>>>>>> Are the metadata and
> > > > >>>>>>>> the data for the same partition always collocated on the
> same
> > > > >>>> broker?
> > > > >>>>>> If
> > > > >>>>>>>> so, how do we enforce that when replicas are reassigned?
> > > > >>>>>>>
> > > > >>>>>>> The source of truth for the data is still in WAL files on
> > object
> > > > >>>> storage.
> > > > >>>>>>> The source of truth for the metadata is in segment files on
> the
> > > > >>>> brokers
> > > > >>>>>> in
> > > > >>>>>>> the replica set. Two new mechanisms are planned, both
> > independent
> > > > of
> > > > >>>> this
> > > > >>>>>>> new proposal, but I want to present them to give the idea
> that
> > > > only a
> > > > >>>>>>> limited amount of data files will be operated locally:
> > > > >>>>>>> - We want to assemble batches into segment files and offload
> > them
> > > > to
> > > > >>>>>>> tiered storage in order to prevent the unbounded growth of
> > batch
> > > > >>>>>> metadata.
> > > > >>>>>>> For this, we need to open only  a few file descriptors (for
> the
> > > > >>>> segment
> > > > >>>>>>> file itself + the necessary indexes) before the segment is
> > fully
> > > > >>>> written
> > > > >>>>>>> and handed over to RemoteLogManager.
> > > > >>>>>>> - We want to assemble local segment files for caching
> purposes
> > as
> > > > >>>> well,
> > > > >>>>>>> i.e. to speed up fetching. This will not materialize the full
> > > > >>>> content of
> > > > >>>>>>> the log, but only the hot set according to some policy (or
> > > > >>>> configurable
> > > > >>>>>>> policies), i.e. the number of segments and file descriptors
> > will
> > > > >>>> also be
> > > > >>>>>>> limited.
> > > > >>>>>>>
> > > > >>>>>>>> The number of RPCs in the produce path is significantly
> > higher.
> > > > For
> > > > >>>>>>>> example, if a produce request has 100 partitions, in a
> cluster
> > > > >>>> with 100
> > > > >>>>>>>> brokers, each produce request could generate 100 more RPC
> > > > requests.
> > > > >>>>>> This
> > > > >>>>>>>> will significantly increase the request rate.
> > > > >>>>>>>
> > > > >>>>>>> This is a valid concern that we considered, but this issue
> can
> > be
> > > > >>>>>>> mitigated. I’ll try to explain the approach.
> > > > >>>>>>> The situation with a single broker is trivial: all the commit
> > > > >>>> requests go
> > > > >>>>>>> from the broker to itself.
> > > > >>>>>>> Let’s scale this to a multi-broker cluster, but located in
> the
> > > > single
> > > > >>>>>> rack
> > > > >>>>>>> (AZ). Any broker can accept Produce requests for diskless
> > > > >>>> partitions, but
> > > > >>>>>>> we can tell producers (through metadata) to always send
> Produce
> > > > >>>> requests
> > > > >>>>>> to
> > > > >>>>>>> leaders. For example, broker 0 hosts the leader replicas for
> > > > diskless
> > > > >>>>>>> partitions t1-0, t2-1, t3-0. It will receive diskless Produce
> > > > >>>> requests
> > > > >>>>>> for
> > > > >>>>>>> these partitions in various combinations, but only for them.
> > > > >>>>>>>
> > > > >>>>>>>                   Broker 0
> > > > >>>>>>>             +-----------------+
> > > > >>>>>>>             |    t1-0         |
> > > > >>>>>>>             |    t2-1 <--------------------+
> > > > >>>>>>>             |    t3-0         |            |
> > > > >>>>>>> produce      | +-------------+ |            |
> > > > >>>>>>> requests     | |  diskless   | |            |
> > > > >>>>>>> --------------->|   produce   +--------------+
> > > > >>>>>>> for these    | | WAL buffer  | |    commit requests
> > > > >>>>>>> partitions   | +-------------+ |    for these partitions
> > > > >>>>>>>             |                 |
> > > > >>>>>>>             +-----------------+
> > > > >>>>>>>
> > > > >>>>>>> The same applies for other brokers in this cluster.
> > Effectively,
> > > > each
> > > > >>>>>>> broker will commit only to itself, which effectively means 1
> > > commit
> > > > >>>>>> request
> > > > >>>>>>> per WAL buffer (this may be 0 physical network calls, if we
> > wish,
> > > > >>>> just a
> > > > >>>>>>> local function call).
> > > > >>>>>>>
> > > > >>>>>>> Now let’s scale this to multiple racks (AZs). Obviously, we
> > > cannot
> > > > >>>> always
> > > > >>>>>>> send Produce requests to the designated leaders of diskless
> > > > >>>> partitions:
> > > > >>>>>>> this would mean inter-AZ network traffic, which we would like
> > to
> > > > >>>> avoid.
> > > > >>>>>> To
> > > > >>>>>>> avoid it, we say that every broker has a “diskless produce
> > > > >>>>>> representative”
> > > > >>>>>>> in every AZ. If we continue our example: when a Produce
> request
> > > for
> > > > >>>> t1-0,
> > > > >>>>>>> t2-1, or t3-0 comes from a producer in AZ 0, it lands on
> > broker 0
> > > > >>>> (in the
> > > > >>>>>>> broker’s AZ the representative is the broker itself).
> However,
> > if
> > > > it
> > > > >>>>>> comes
> > > > >>>>>>> from AZ 1, it lands on broker 1; in AZ 2, it’s broker 2.
> > > > >>>>>>>
> > > > >>>>>>> |produce requests         |produce requests        |produce
> > > > >>>> requests
> > > > >>>>>>> |for t1-0, t2-1, t3-0     |for t1-0, t2-1, t3-0    |for t1-0,
> > > t2-1,
> > > > >>>>>> t3-0
> > > > >>>>>>> |from AZ 0                |from AZ 1               |from AZ 2
> > > > >>>>>>> v                         v                        v
> > > > >>>>>>> Broker 0 (AZ 0)        Broker 1 (AZ 1)        Broker 2 (AZ 2)
> > > > >>>>>>> +---------------+      +---------------+
> +---------------+
> > > > >>>>>>> |     t1-0      |      |               |      |
>  |
> > > > >>>>>>> |     t2-1      |      |               |      |
>  |
> > > > >>>>>>> |     t3-0      |      |               |      |
>  |
> > > > >>>>>>> +---------------+      +--------+------+
> +--------+------+
> > > > >>>>>>>    ^     ^                    |                      |
> > > > >>>>>>>    |     +--------------------+                      |
> > > > >>>>>>>    |     commit requests for these partitions        |
> > > > >>>>>>>    |                                                 |
> > > > >>>>>>>    +-------------------------------------------------+
> > > > >>>>>>>          commit requests for these partitions
> > > > >>>>>>>
> > > > >>>>>>> All the partitions that broker 0 is the leader of will be
> > > > >>>> “represented”
> > > > >>>>>> by
> > > > >>>>>>> brokers 1 and 2 in their AZs.
> > > > >>>>>>>
> > > > >>>>>>> Of course, this relationship goes both ways between AZs (not
> > > > >>>> necessarily
> > > > >>>>>>> between the same brokers). It means that provided the cluster
> > is
> > > > >>>> balanced
> > > > >>>>>>> by the number of brokers per AZ, each broker will represent
> > > > >>>>>> (number_of_azs
> > > > >>>>>>> - 1) other brokers. This will result in the situation that
> for
> > > the
> > > > >>>>>> majority
> > > > >>>>>>> of commits, each broker will do up to (number_of_azs - 1)
> > network
> > > > >>>> commit
> > > > >>>>>>> requests (plus one local). Cloud regions tend to have 3 AZs,
> > very
> > > > >>>> rarely
> > > > >>>>>>> more. That means, brokers will be doing up to 2 network
> commit
> > > > >>>> requests
> > > > >>>>>> per
> > > > >>>>>>> WAL file.
> > > > >>>>>>>
> > > > >>>>>>> There are the following exceptions:
> > > > >>>>>>> 1. Broker count imbalance between AZs. For example, when we
> > have
> > > 2
> > > > >>>> AZs
> > > > >>>>>> and
> > > > >>>>>>> one has three brokers and another AZ has one. This one broker
> > > will
> > > > do
> > > > >>>>>>> between 1 and 3 commit requests per WAL file. This is not an
> > > > extreme
> > > > >>>>>>> amplification. Such an imbalance is not healthy in most
> > practical
> > > > >>>> setups
> > > > >>>>>>> and should be avoided anyway.
> > > > >>>>>>> 2. Leadership changes and metadata propagation period. When
> the
> > > > >>>> partition
> > > > >>>>>>> t3-0 is relocated from broker 0 to some broker 3, the
> producers
> > > > will
> > > > >>>> not
> > > > >>>>>>> know this immediately (unless we want to be strict and
> respond
> > > with
> > > > >>>>>>> NOT_LEADER_OR_FOLLOWER). So if t1-0, t2-1, and t3-0 will come
> > > > >>>> together
> > > > >>>>>> in a
> > > > >>>>>>> WAL buffer on broker 2, it will have to send two commit
> > requests:
> > > > to
> > > > >>>>>> broker
> > > > >>>>>>> 0 to commit t1-0 and t2-1, and to broker 3 to commit t3-0.
> This
> > > > >>>> situation
> > > > >>>>>>> is not permanent and as producers update the cluster
> metadata,
> > it
> > > > >>>> will be
> > > > >>>>>>> resolved.
> > > > >>>>>>>
> > > > >>>>>>> This all could be built with the metadata crafting mechanism
> > only
> > > > >>>> (which
> > > > >>>>>>> is anyway needed for Diskless in one way or another to direct
> > > > >>>> producers
> > > > >>>>>> and
> > > > >>>>>>> consumers where we need to avoid inter-AZ traffic), just with
> > the
> > > > >>>> right
> > > > >>>>>>> policy for it (for example, some deterministic hash-based
> > > formula).
> > > > >>>> I.e.
> > > > >>>>>> no
> > > > >>>>>>> explicit support for “produce representative” or anything
> like
> > > this
> > > > >>>> is
> > > > >>>>>>> needed on the cluster level, in KRaft, etc.
> > > > >>>>>>>
> > > > >>>>>>>> The same WAL file metadata is now duplicated into two
> places,
> > > > >>>> partition
> > > > >>>>>>>> leader and WAL File Manager. Which one is the source of
> truth,
> > > and
> > > > >>>> how
> > > > >>>>>> do
> > > > >>>>>>>> we maintain consistency between the two places?
> > > > >>>>>>>
> > > > >>>>>>> We do only two operations on WAL files that span multiple
> > > diskless
> > > > >>>>>>> partitions: committing and deleting. Commits can be done
> > > > >>>> independently as
> > > > >>>>>>> described above. But deletes are different, because when a
> file
> > > is
> > > > >>>>>> deleted,
> > > > >>>>>>> this affects all the partitions that still have alive batches
> > in
> > > > this
> > > > >>>>>> file
> > > > >>>>>>> (if any).
> > > > >>>>>>>
> > > > >>>>>>> The WAL file manager is a necessary point of coordination to
> > > delete
> > > > >>>> WAL
> > > > >>>>>>> files safely. We can say it is the source of truth about
> files
> > > > >>>>>> themselves,
> > > > >>>>>>> while the partition leaders and their logs hold the truth
> about
> > > > >>>> whether a
> > > > >>>>>>> particular file contains live batches of this particular
> > > partition.
> > > > >>>>>>>
> > > > >>>>>>> The file manager will do this important task: be able to say
> > for
> > > > sure
> > > > >>>>>> that
> > > > >>>>>>> a file does not contain any live batch of any existing
> > partition.
> > > > For
> > > > >>>>>> this,
> > > > >>>>>>> it will have to periodically check against the partition
> > leaders.
> > > > >>>>>>> Considering that batch deletion is irreversible, when we
> > declare
> > > a
> > > > >>>> file
> > > > >>>>>>> “empty”, this is guaranteed to be and stay so.
> > > > >>>>>>>
> > > > >>>>>>> The file manager has to know about files being committed to
> > start
> > > > >>>> track
> > > > >>>>>>> them and periodically check if they are empty. We can
> consider
> > > > >>>> various
> > > > >>>>>> ways
> > > > >>>>>>> to achieve this:
> > > > >>>>>>> 1. As was proposed in my previous message: best effort commit
> > by
> > > > >>>> brokers
> > > > >>>>>> +
> > > > >>>>>>> periodic prefix scans of object storage to detect files that
> > went
> > > > >>>> below
> > > > >>>>>> the
> > > > >>>>>>> radar due to network issue or the file manager temporary
> > > > >>>> unavailability.
> > > > >>>>>>> We’re speaking about listing the file names only and opening
> > only
> > > > >>>>>>> previously unknown files in order to find the partitions
> > involved
> > > > >>>> with
> > > > >>>>>> them.
> > > > >>>>>>> 2. Only do scans without explicit commit, i.e. fill the list
> of
> > > > files
> > > > >>>>>>> fully asynchronously and in the background. This may be not
> > ideal
> > > > >>>> due to
> > > > >>>>>>> costs and performance of scanning tons of files. However, the
> > > > number
> > > > >>>> of
> > > > >>>>>>> live WAL files should be limited due to tiered storage
> > > offloading +
> > > > >>>> we
> > > > >>>>>> can
> > > > >>>>>>> optimize this if we give files some global soft order in
> their
> > > > names.
> > > > >>>>>>>
> > > > >>>>>>>> I am not sure how this design simplifies the implementation.
> > The
> > > > >>>>>> existing
> > > > >>>>>>>> producer/replication code can't be simply reused. Adjusting
> > both
> > > > >>>> the
> > > > >>>>>>> write
> > > > >>>>>>>> path in the leader and the replication path in the follower
> to
> > > > >>>>>> understand
> > > > >>>>>>>> batch-header only data is quite intrusive to the existing
> > logic.
> > > > >>>>>>>
> > > > >>>>>>> It is true that we’ll have to change LocalLog and UnifiedLog
> in
> > > > >>>> order to
> > > > >>>>>>> support these changes. However, it seems that idempotence,
> > > > >>>> transactions,
> > > > >>>>>>> queues, tiered storage will have to be changed less than with
> > the
> > > > >>>>>> original
> > > > >>>>>>> design. This is because the partition leader state would
> remain
> > > in
> > > > >>>> the
> > > > >>>>>> same
> > > > >>>>>>> place (on brokers) and existing workflows that involve it
> would
> > > > have
> > > > >>>> to
> > > > >>>>>> be
> > > > >>>>>>> changed less compared to the situation where we globalize the
> > > > >>>> partition
> > > > >>>>>>> leader state in the batch coordinator. I admit this is hard
> to
> > > make
> > > > >>>>>>> convincing without both real implementations to hand :)
> > > > >>>>>>>
> > > > >>>>>>>> I am also
> > > > >>>>>>>> not sure how this enables seamless switching the topic modes
> > > > >>>> between
> > > > >>>>>>>> diskless and classic. Could you provide more details on
> those?
> > > > >>>>>>>
> > > > >>>>>>> Let’s consider the scenario of turning a classic topic into
> > > > >>>> diskless. The
> > > > >>>>>>> user sets diskless.enabled=true, the leader receives this
> > > metadata
> > > > >>>> update
> > > > >>>>>>> and does the following:
> > > > >>>>>>> 1. Stop accepting normal append writes.
> > > > >>>>>>> 2. Close the current active segment.
> > > > >>>>>>> 3. Start a new segment that will be written in the diskless
> > > format
> > > > >>>> (i.e.
> > > > >>>>>>> without data).
> > > > >>>>>>> 4. Start accepting diskless commits.
> > > > >>>>>>>
> > > > >>>>>>> Since it’s the same log, the followers will know about that
> > > switch
> > > > >>>>>>> consistently. They will finish replicating the classic
> segments
> > > and
> > > > >>>> start
> > > > >>>>>>> replicating the diskless ones. They will always know where
> each
> > > > >>>> batch is
> > > > >>>>>>> located (either inside a classic segment or referenced by a
> > > > diskless
> > > > >>>>>> one).
> > > > >>>>>>> Switching back should be similar.
> > > > >>>>>>>
> > > > >>>>>>> Doing this with the coordinator is possible, but has some
> > > caveats.
> > > > >>>> The
> > > > >>>>>>> leader must do the following:
> > > > >>>>>>> 1. Stop accepting normal append writes.
> > > > >>>>>>> 2. Close the current active segment.
> > > > >>>>>>> 3. Write a special control segment to persist and replicate
> the
> > > > fact
> > > > >>>> that
> > > > >>>>>>> from offset N the partition is now in the diskless mode.
> > > > >>>>>>> 4. Inform the coordinator about the first offset N of the
> > > “diskless
> > > > >>>> era”.
> > > > >>>>>>> 5. Inform the controller quorum that the transition has
> > finished
> > > > and
> > > > >>>> that
> > > > >>>>>>> brokers now can process diskless writes for this partition.
> > > > >>>>>>> This could fail at some points, so this will probably require
> > > some
> > > > >>>>>>> explicit state machine with replication either in the
> partition
> > > log
> > > > >>>> or in
> > > > >>>>>>> KRaft.
> > > > >>>>>>>
> > > > >>>>>>> It seems that the coordinator-less approach makes this
> simpler
> > > > >>>> because
> > > > >>>>>> the
> > > > >>>>>>> “coordinator” for the partition and the partition leader are
> > the
> > > > >>>> same and
> > > > >>>>>>> they store the partition metadata in the same log, too. While
> > in
> > > > the
> > > > >>>>>>> coordinator approach we have to perform some kind of a
> > > distributed
> > > > >>>> commit
> > > > >>>>>>> to handover metadata management from the classic partition
> > leader
> > > > to
> > > > >>>> the
> > > > >>>>>>> batch coordinator.
> > > > >>>>>>>
> > > > >>>>>>> I hope these explanations help to clarify the idea. Please
> let
> > me
> > > > >>>> know if
> > > > >>>>>>> I should go deeper anywhere.
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>> Ivan and the Diskless team
> > > > >>>>>>>
> > > > >>>>>>> On Tue, Oct 7, 2025, at 01:44, Jun Rao wrote:
> > > > >>>>>>>> Hi, Ivan,
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for the update.
> > > > >>>>>>>>
> > > > >>>>>>>> I am not sure that I fully understand the new design, but it
> > > seems
> > > > >>>> less
> > > > >>>>>>>> clean than before.
> > > > >>>>>>>>
> > > > >>>>>>>> Does each partition now have a metadata partition and a
> > separate
> > > > >>>> data
> > > > >>>>>>>> partition? If so, I am concerned that it essentially doubles
> > the
> > > > >>>> number
> > > > >>>>>>> of
> > > > >>>>>>>> partitions, which impacts the number of open file
> descriptors
> > > and
> > > > >>>> the
> > > > >>>>>>>> required IOPS, and so on. It also seems wasteful to have a
> > > > separate
> > > > >>>>>>>> partition just to store the metadata. It's as if we are
> > creating
> > > > an
> > > > >>>>>>>> internal topic with an unbounded number of partitions. Are
> the
> > > > >>>> metadata
> > > > >>>>>>> and
> > > > >>>>>>>> the data for the same partition always collocated on the
> same
> > > > >>>> broker?
> > > > >>>>>> If
> > > > >>>>>>>> so, how do we enforce that when replicas are reassigned?
> > > > >>>>>>>>
> > > > >>>>>>>> The number of RPCs in the produce path is significantly
> > higher.
> > > > For
> > > > >>>>>>>> example, if a produce request has 100 partitions, in a
> cluster
> > > > >>>> with 100
> > > > >>>>>>>> brokers, each produce request could generate 100 more RPC
> > > > requests.
> > > > >>>>>> This
> > > > >>>>>>>> will significantly increase the request rate.
> > > > >>>>>>>>
> > > > >>>>>>>> The same WAL file metadata is now duplicated into two
> places,
> > > > >>>> partition
> > > > >>>>>>>> leader and WAL File Manager. Which one is the source of
> truth,
> > > and
> > > > >>>> how
> > > > >>>>>> do
> > > > >>>>>>>> we maintain consistency between the two places?
> > > > >>>>>>>>
> > > > >>>>>>>> I am not sure how this design simplifies the implementation.
> > The
> > > > >>>>>> existing
> > > > >>>>>>>> producer/replication code can't be simply reused. Adjusting
> > both
> > > > >>>> the
> > > > >>>>>>> write
> > > > >>>>>>>> path in the leader and the replication path in the follower
> to
> > > > >>>>>> understand
> > > > >>>>>>>> batch-header only data is quite intrusive to the existing
> > > logic. I
> > > > >>>> am
> > > > >>>>>>> also
> > > > >>>>>>>> not sure how this enables seamless switching the topic modes
> > > > >>>> between
> > > > >>>>>>>> diskless and classic. Could you provide more details on
> those?
> > > > >>>>>>>>
> > > > >>>>>>>> Jun
> > > > >>>>>>>>
> > > > >>>>>>>> On Thu, Oct 2, 2025 at 5:08 AM Ivan Yurchenko <
> [email protected]
> > >
> > > > >>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi dear Kafka community,
> > > > >>>>>>>>>
> > > > >>>>>>>>> In the initial Diskless proposal, we proposed to have a
> > > separate
> > > > >>>>>>>>> component, batch/diskless coordinator, whose role would be
> to
> > > > >>>>>> centrally
> > > > >>>>>>>>> manage the batch and WAL file metadata for diskless topics.
> > > This
> > > > >>>>>>> component
> > > > >>>>>>>>> drew many reasonable comments from the community about how
> it
> > > > >>>> would
> > > > >>>>>>> support
> > > > >>>>>>>>> various Kafka features (transactions, queues) and its
> > > > >>>> scalability.
> > > > >>>>>>> While we
> > > > >>>>>>>>> believe we have good answers to all the expressed concerns,
> > we
> > > > >>>> took a
> > > > >>>>>>> step
> > > > >>>>>>>>> back and looked at the problem from a different
> perspective.
> > > > >>>>>>>>>
> > > > >>>>>>>>> We would like to propose an alternative Diskless design
> > > *without
> > > > >>>> a
> > > > >>>>>>>>> centralized coordinator*. We believe this approach has
> > > potential
> > > > >>>> and
> > > > >>>>>>>>> propose to discuss it as it may be more appealing to the
> > > > >>>> community.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Let us explain the idea. Most of the complications with the
> > > > >>>> original
> > > > >>>>>>>>> Diskless approach come from one necessary architecture
> > change:
> > > > >>>>>>> globalizing
> > > > >>>>>>>>> the local state of partition leader in the batch
> coordinator.
> > > > >>>> This
> > > > >>>>>>> causes
> > > > >>>>>>>>> deviations to the established workflows in various features
> > > like
> > > > >>>>>>> produce
> > > > >>>>>>>>> idempotence and transactions, queues, retention, etc. These
> > > > >>>>>> deviations
> > > > >>>>>>> need
> > > > >>>>>>>>> to be carefully considered, designed, and later implemented
> > and
> > > > >>>>>>> tested. In
> > > > >>>>>>>>> the new approach we want to avoid this by making partition
> > > > >>>> leaders
> > > > >>>>>>> again
> > > > >>>>>>>>> responsible for managing their partitions, even in diskless
> > > > >>>> topics.
> > > > >>>>>>>>>
> > > > >>>>>>>>> In classic Kafka topics, batch data and metadata are
> blended
> > > > >>>> together
> > > > >>>>>>> in
> > > > >>>>>>>>> the one partition log. The crux of the Diskless idea is to
> > > > >>>> decouple
> > > > >>>>>>> them
> > > > >>>>>>>>> and move data to the remote storage, while keeping metadata
> > > > >>>> somewhere
> > > > >>>>>>> else.
> > > > >>>>>>>>> Using the central batch coordinator for managing batch
> > metadata
> > > > >>>> is
> > > > >>>>>> one
> > > > >>>>>>> way,
> > > > >>>>>>>>> but not the only.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Let’s now think about managing metadata for each user
> > partition
> > > > >>>>>>>>> independently. Generally partitions are independent and
> don’t
> > > > >>>> share
> > > > >>>>>>>>> anything apart from that their data are mixed in WAL files.
> > If
> > > we
> > > > >>>>>>> figure
> > > > >>>>>>>>> out how to commit and later delete WAL files safely, we
> will
> > > > >>>> achieve
> > > > >>>>>>> the
> > > > >>>>>>>>> necessary autonomy that allows us to get rid of the central
> > > batch
> > > > >>>>>>>>> coordinator. Instead, *each diskless user partition will be
> > > > >>>> managed
> > > > >>>>>> by
> > > > >>>>>>> its
> > > > >>>>>>>>> leader*, as in classic Kafka topics. Also like in classic
> > > > >>>> topics, the
> > > > >>>>>>>>> leader uses the partition log as the way to persist batch
> > > > >>>> metadata,
> > > > >>>>>>> i.e.
> > > > >>>>>>>>> the regular batch header + the information about how to
> find
> > > this
> > > > >>>>>>> batch on
> > > > >>>>>>>>> remote storage. In contrast to classic topics, batch data
> is
> > in
> > > > >>>>>> remote
> > > > >>>>>>>>> storage.
> > > > >>>>>>>>>
> > > > >>>>>>>>> For clarity, let’s compare the three designs:
> > > > >>>>>>>>> • Classic topics:
> > > > >>>>>>>>>  • Data and metadata are co-located in the partition log.
> > > > >>>>>>>>>  • The partition log content: [Batch header
> (metadata)|Batch
> > > > >>>> data].
> > > > >>>>>>>>>  • The partition log is replicated to the followers.
> > > > >>>>>>>>>  • The replicas and leader have local state built from
> > > > >>>> metadata.
> > > > >>>>>>>>> • Original Diskless:
> > > > >>>>>>>>>  • Metadata is in the batch coordinator, data is on remote
> > > > >>>> storage.
> > > > >>>>>>>>>  • The partition state is global in the batch coordinator.
> > > > >>>>>>>>> • New Diskless:
> > > > >>>>>>>>>  • Metadata is in the partition log, data is on remote
> > storage.
> > > > >>>>>>>>>  • Partition log content: [Batch header (metadata)|Batch
> > > > >>>>>> coordinates
> > > > >>>>>>> on
> > > > >>>>>>>>> remote storage].
> > > > >>>>>>>>>  • The partition log is replicated to the followers.
> > > > >>>>>>>>>  • The replicas and leader have local state built from
> > > > >>>> metadata.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Let’s consider the produce path. Here’s the reminder of the
> > > > >>>> original
> > > > >>>>>>>>> Diskless design:
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> The new approach could be depicted as the following:
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> As you can see, the main difference is that now instead of
> a
> > > > >>>> single
> > > > >>>>>>> commit
> > > > >>>>>>>>> request to the batch coordinator, we send multiple parallel
> > > > >>>> commit
> > > > >>>>>>> requests
> > > > >>>>>>>>> to all the leaders of each partition involved in the WAL
> > file.
> > > > >>>> Each
> > > > >>>>>> of
> > > > >>>>>>> them
> > > > >>>>>>>>> will commit its batches independently, without coordinating
> > > with
> > > > >>>>>> other
> > > > >>>>>>>>> leaders and any other components. Batch data is addressed
> by
> > > the
> > > > >>>> WAL
> > > > >>>>>>> file
> > > > >>>>>>>>> name, the byte offset and size, which allows partitions to
> > know
> > > > >>>>>> nothing
> > > > >>>>>>>>> about other partitions to access their data in shared WAL
> > > files.
> > > > >>>>>>>>>
> > > > >>>>>>>>> The number of partitions involved in a single WAL file may
> be
> > > > >>>> quite
> > > > >>>>>>> large,
> > > > >>>>>>>>> e.g. a hundred. A hundred network requests to commit one
> WAL
> > > > >>>> file is
> > > > >>>>>>> very
> > > > >>>>>>>>> impractical. However, there are ways to reduce this number:
> > > > >>>>>>>>> 1. Partition leaders are located on brokers. Requests to
> > > > >>>> leaders on
> > > > >>>>>>> one
> > > > >>>>>>>>> broker could be grouped together into a single physical
> > network
> > > > >>>>>> request
> > > > >>>>>>>>> (resembling the normal Produce request that may carry
> batches
> > > for
> > > > >>>>>> many
> > > > >>>>>>>>> partitions inside). This will cap the number of network
> > > requests
> > > > >>>> to
> > > > >>>>>> the
> > > > >>>>>>>>> number of brokers in the cluster.
> > > > >>>>>>>>> 2. If we craft the cluster metadata to make producers send
> > > their
> > > > >>>>>>> requests
> > > > >>>>>>>>> to the right brokers (with respect to AZs), we may achieve
> > the
> > > > >>>> higher
> > > > >>>>>>>>> concentration of logical commit requests in physical
> network
> > > > >>>> requests
> > > > >>>>>>>>> reducing the number of the latter ones even further,
> ideally
> > to
> > > > >>>> one.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Obviously, out of multiple commit requests some may fail or
> > > time
> > > > >>>> out
> > > > >>>>>>> for a
> > > > >>>>>>>>> variety of reasons. This is fine. Some producers will
> receive
> > > > >>>> totally
> > > > >>>>>>> or
> > > > >>>>>>>>> partially failed responses to their Produce requests,
> similar
> > > to
> > > > >>>> what
> > > > >>>>>>> they
> > > > >>>>>>>>> would have received when appending to a classic topic fails
> > or
> > > > >>>> times
> > > > >>>>>>> out.
> > > > >>>>>>>>> If a partition experiences problems, other partitions will
> > not
> > > be
> > > > >>>>>>> affected
> > > > >>>>>>>>> (again, like in classic topics). Of course, the uncommitted
> > > data
> > > > >>>> will
> > > > >>>>>>> be
> > > > >>>>>>>>> garbage in WAL files. But WAL files are short-lived
> (batches
> > > are
> > > > >>>>>>> constantly
> > > > >>>>>>>>> assembled into segments and offloaded to tiered storage),
> so
> > > this
> > > > >>>>>>> garbage
> > > > >>>>>>>>> will be eventually deleted.
> > > > >>>>>>>>>
> > > > >>>>>>>>> For safely deleting WAL files we now need to centrally
> manage
> > > > >>>> them,
> > > > >>>>>> as
> > > > >>>>>>>>> this is the only state and logic that spans multiple
> > > partitions.
> > > > >>>> On
> > > > >>>>>> the
> > > > >>>>>>>>> diagram, you can see another commit request called “Commit
> > file
> > > > >>>> (best
> > > > >>>>>>>>> effort)” going to the WAL File Manager. This manager will
> be
> > > > >>>>>>> responsible
> > > > >>>>>>>>> for the following:
> > > > >>>>>>>>> 1. Collecting (by requests from brokers) and persisting
> > > > >>>> information
> > > > >>>>>>> about
> > > > >>>>>>>>> committed WAL files.
> > > > >>>>>>>>> 2. To handle potential failures in file information
> delivery,
> > > it
> > > > >>>>>> will
> > > > >>>>>>> be
> > > > >>>>>>>>> doing prefix scan on the remote storage periodically to
> find
> > > and
> > > > >>>>>>> register
> > > > >>>>>>>>> unknown files. The period of this scan will be configurable
> > and
> > > > >>>>>> ideally
> > > > >>>>>>>>> should be quite long.
> > > > >>>>>>>>> 3. Checking with the relevant partition leaders (after a
> > grace
> > > > >>>>>>> period) if
> > > > >>>>>>>>> they still have batches in a particular file.
> > > > >>>>>>>>> 4. Physically deleting files when they aren’t anymore
> > referred
> > > > >>>> to by
> > > > >>>>>>> any
> > > > >>>>>>>>> partition.
> > > > >>>>>>>>>
> > > > >>>>>>>>> This new design offers the following advantages:
> > > > >>>>>>>>> 1. It simplifies the implementation of many Kafka features
> > such
> > > > >>>> as
> > > > >>>>>>>>> idempotence, transactions, queues, tiered storage,
> retention.
> > > > >>>> Now we
> > > > >>>>>>> don’t
> > > > >>>>>>>>> need to abstract away and reuse the code from partition
> > leaders
> > > > >>>> in
> > > > >>>>>> the
> > > > >>>>>>>>> batch coordinator. Instead, we will literally use the same
> > code
> > > > >>>> paths
> > > > >>>>>>> in
> > > > >>>>>>>>> leaders, with little adaptation. Workflows from classic
> > topics
> > > > >>>> mostly
> > > > >>>>>>>>> remain unchanged.
> > > > >>>>>>>>> For example, it seems that
> > > > >>>>>>>>> ReplicaManager.maybeSendPartitionsToTransactionCoordinator
> > and
> > > > >>>>>>>>> KafkaApis.handleWriteTxnMarkersRequest used for transaction
> > > > >>>> support
> > > > >>>>>> on
> > > > >>>>>>> the
> > > > >>>>>>>>> partition leader side could be used for diskless topics
> with
> > > > >>>> little
> > > > >>>>>>>>> adaptation. ProducerStateManager, needed for both
> idempotent
> > > > >>>> produce
> > > > >>>>>>> and
> > > > >>>>>>>>> transactions, would be reused.
> > > > >>>>>>>>> Another example is share groups support, where the share
> > > > >>>> partition
> > > > >>>>>>> leader,
> > > > >>>>>>>>> being co-located with the partition leader, would execute
> the
> > > > >>>> same
> > > > >>>>>>> logic
> > > > >>>>>>>>> for both diskless and classic topics.
> > > > >>>>>>>>> 2. It returns to the familiar partition-based scaling
> model,
> > > > >>>> where
> > > > >>>>>>>>> partitions are independent.
> > > > >>>>>>>>> 3. It makes the operation and failure patterns closer to
> the
> > > > >>>>>> familiar
> > > > >>>>>>>>> ones from classic topics.
> > > > >>>>>>>>> 4. It opens a straightforward path to seamless switching
> the
> > > > >>>> topics
> > > > >>>>>>> modes
> > > > >>>>>>>>> between diskless and classic.
> > > > >>>>>>>>>
> > > > >>>>>>>>> The rest of the things remain unchanged compared to the
> > > previous
> > > > >>>>>>> Diskless
> > > > >>>>>>>>> design (after all previous discussions). Such things as
> local
> > > > >>>> segment
> > > > >>>>>>>>> materialization by replicas, the consume path, tiered
> storage
> > > > >>>>>>> integration,
> > > > >>>>>>>>> etc.
> > > > >>>>>>>>>
> > > > >>>>>>>>> If the community finds this design more suitable, we will
> > > update
> > > > >>>> the
> > > > >>>>>>>>> KIP(s) accordingly and continue working on it. Please let
> us
> > > know
> > > > >>>>>> what
> > > > >>>>>>> you
> > > > >>>>>>>>> think.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best regards,
> > > > >>>>>>>>> Ivan and Diskless team
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote:
> > > > >>>>>>>>>> Hi Justine,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Yes, you're right. We need to track the aborted
> transactions
> > > > >>>> for in
> > > > >>>>>>> the
> > > > >>>>>>>>> diskless coordinator for as long as the corresponding
> offsets
> > > are
> > > > >>>>>>> there.
> > > > >>>>>>>>> With the tiered storage unification Greg mentioned earlier,
> > > this
> > > > >>>> will
> > > > >>>>>>> be
> > > > >>>>>>>>> finite time even for infinite data retention.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Best,
> > > > >>>>>>>>>> Ivan
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote:
> > > > >>>>>>>>>>> Hey Ivan,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks for the response. I think most of what you said
> made
> > > > >>>>>> sense,
> > > > >>>>>>> but
> > > > >>>>>>>>> I
> > > > >>>>>>>>>>> did have some questions about this part:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> As we understand this, the partition leader in classic
> > > > >>>> topics
> > > > >>>>>>> forgets
> > > > >>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses
> > > > >>>> it). The
> > > > >>>>>>>>>>> transaction coordinator acts like the main guardian,
> > allowing
> > > > >>>>>>> partition
> > > > >>>>>>>>>>> leaders to do this safely. Please correct me if this is
> > > > >>>> wrong. We
> > > > >>>>>>> think
> > > > >>>>>>>>>>> about relying on this with the batch coordinator and
> delete
> > > > >>>> the
> > > > >>>>>>>>> information
> > > > >>>>>>>>>>> about a transaction once it’s finished (as there’s no
> > > > >>>> replication
> > > > >>>>>>> and
> > > > >>>>>>>>> HWM
> > > > >>>>>>>>>>> advances immediately).
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> I didn't quite understand this. In classic topics, we
> have
> > > > >>>> maps
> > > > >>>>>> for
> > > > >>>>>>>>> ongoing
> > > > >>>>>>>>>>> transactions which remove state when the transaction is
> > > > >>>> completed
> > > > >>>>>>> and
> > > > >>>>>>>>> an
> > > > >>>>>>>>>>> aborted transactions index which is retained for much
> > longer.
> > > > >>>>>> Once
> > > > >>>>>>> the
> > > > >>>>>>>>>>> transaction is completed, the coordinator is no longer
> > > > >>>> involved
> > > > >>>>>> in
> > > > >>>>>>>>>>> maintaining this partition side state, and it is subject
> to
> > > > >>>>>>> compaction
> > > > >>>>>>>>> etc.
> > > > >>>>>>>>>>> Looking back at the outline provided above, I didn't see
> > much
> > > > >>>>>>> about the
> > > > >>>>>>>>>>> fetch path, so maybe that could be expanded a bit
> further.
> > I
> > > > >>>> saw
> > > > >>>>>>> the
> > > > >>>>>>>>>>> following in a response:
> > > > >>>>>>>>>>>> When the broker constructs a fully valid local segment,
> > > > >>>> all the
> > > > >>>>>>>>> necessary
> > > > >>>>>>>>>>> control batches will be inserted and indices, including
> the
> > > > >>>>>>> transaction
> > > > >>>>>>>>>>> index will be built to serve FetchRequests exactly as
> they
> > > > >>>> are
> > > > >>>>>>> today.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Based on this, it seems like we need to retain the
> > > > >>>> information
> > > > >>>>>>> about
> > > > >>>>>>>>>>> aborted txns for longer.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> Justine
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko <
> > > > >>>> [email protected]>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> Hi Justine and all,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Thank you for your questions!
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified
> > > > >>>> with
> > > > >>>>>>>>> producer ID
> > > > >>>>>>>>>>>>> and epoch, the positive result of this check could be
> > > > >>>> cached
> > > > >>>>>>>>> locally
> > > > >>>>>>>>>>>>> Are we saying that only new transaction version 2
> > > > >>>>>> transactions
> > > > >>>>>>> can
> > > > >>>>>>>>> be
> > > > >>>>>>>>>>>> used
> > > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions
> > > > >>>> with
> > > > >>>>>>>>> producer id +
> > > > >>>>>>>>>>>>> epoch
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> You’re right that we (probably unintentionally) focused
> > > > >>>> only on
> > > > >>>>>>>>> version 2.
> > > > >>>>>>>>>>>> We can either limit the support to version 2 or consider
> > > > >>>> using
> > > > >>>>>>> some
> > > > >>>>>>>>>>>> surrogates to support version 1.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final
> transactional
> > > > >>>>>>> checks
> > > > >>>>>>>>> of the
> > > > >>>>>>>>>>>>> batches. This procedure would output the same errors
> > > > >>>> like the
> > > > >>>>>>>>> partition
> > > > >>>>>>>>>>>>> leader in classic topics would do.
> > > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be
> > > > >>>>>> checking
> > > > >>>>>>> if
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>> transaction was still ongoing for example?* *
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Yes, the producer epoch, that the transaction is
> ongoing,
> > > > >>>> and
> > > > >>>>>> of
> > > > >>>>>>>>> course
> > > > >>>>>>>>>>>> the normal idempotence checks. What the partition leader
> > > > >>>> in the
> > > > >>>>>>>>> classic
> > > > >>>>>>>>>>>> topics does before appending a batch to the local log
> > > > >>>> (e.g. in
> > > > >>>>>>>>>>>> UnifiedLog.maybeStartTransactionVerification and
> > > > >>>>>>>>>>>> UnifiedLog.analyzeAndValidateProducerState). In
> Diskless,
> > > > >>>> we
> > > > >>>>>>>>> unfortunately
> > > > >>>>>>>>>>>> cannot do these checks before appending the data to the
> > WAL
> > > > >>>>>>> segment
> > > > >>>>>>>>> and
> > > > >>>>>>>>>>>> uploading it, but we can “tombstone” these batches in
> the
> > > > >>>> batch
> > > > >>>>>>>>> coordinator
> > > > >>>>>>>>>>>> during the final commit.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Is there state about ongoing
> > > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other
> > > > >>>> state
> > > > >>>>>>>>> mentioned
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>> the End transaction section, but it's not super clear
> > > > >>>> what
> > > > >>>>>>> state is
> > > > >>>>>>>>>>>> stored
> > > > >>>>>>>>>>>>> and when it is stored.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Right, this should have been more explicit. As the
> > > > >>>> partition
> > > > >>>>>>> leader
> > > > >>>>>>>>> tracks
> > > > >>>>>>>>>>>> ongoing transactions for classic topics, the batch
> > > > >>>> coordinator
> > > > >>>>>>> has
> > > > >>>>>>>>> to as
> > > > >>>>>>>>>>>> well. So when a transaction starts and ends, the
> > > > >>>> transaction
> > > > >>>>>>>>> coordinator
> > > > >>>>>>>>>>>> must inform the batch coordinator about this.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO --
> > > > >>>> perhaps
> > > > >>>>>>> that
> > > > >>>>>>>>> would
> > > > >>>>>>>>>>>> be
> > > > >>>>>>>>>>>>> stored in the batch coordinator?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Yes. This could be deduced from the committed batches
> and
> > > > >>>> other
> > > > >>>>>>>>>>>> information, but for the sake of performance we’d better
> > > > >>>> store
> > > > >>>>>> it
> > > > >>>>>>>>>>>> explicitly.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long
> transactional
> > > > >>>>>>> state is
> > > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be
> > > > >>>>>> cleaned
> > > > >>>>>>> up?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> As we understand this, the partition leader in classic
> > > > >>>> topics
> > > > >>>>>>> forgets
> > > > >>>>>>>>>>>> about a transaction once it’s replicated (HWM overpasses
> > > > >>>> it).
> > > > >>>>>> The
> > > > >>>>>>>>>>>> transaction coordinator acts like the main guardian,
> > > > >>>> allowing
> > > > >>>>>>>>> partition
> > > > >>>>>>>>>>>> leaders to do this safely. Please correct me if this is
> > > > >>>> wrong.
> > > > >>>>>> We
> > > > >>>>>>>>> think
> > > > >>>>>>>>>>>> about relying on this with the batch coordinator and
> > > > >>>> delete the
> > > > >>>>>>>>> information
> > > > >>>>>>>>>>>> about a transaction once it’s finished (as there’s no
> > > > >>>>>> replication
> > > > >>>>>>>>> and HWM
> > > > >>>>>>>>>>>> advances immediately).
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>> Ivan
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
> > > > >>>>>>>>>>>>> Hey folks,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Excited to see some updates related to transactions!
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I had a few questions.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 1. >Since a transaction could be uniquely identified
> > > > >>>> with
> > > > >>>>>>>>> producer ID
> > > > >>>>>>>>>>>>> and epoch, the positive result of this check could be
> > > > >>>> cached
> > > > >>>>>>>>> locally
> > > > >>>>>>>>>>>>> Are we saying that only new transaction version 2
> > > > >>>>>> transactions
> > > > >>>>>>> can
> > > > >>>>>>>>> be
> > > > >>>>>>>>>>>> used
> > > > >>>>>>>>>>>>> here? If not, we can't uniquely identify transactions
> > > > >>>> with
> > > > >>>>>>>>> producer id +
> > > > >>>>>>>>>>>>> epoch
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 2. >The batch coordinator does the final
> transactional
> > > > >>>>>>> checks
> > > > >>>>>>>>> of the
> > > > >>>>>>>>>>>>> batches. This procedure would output the same errors
> > > > >>>> like the
> > > > >>>>>>>>> partition
> > > > >>>>>>>>>>>>> leader in classic topics would do.
> > > > >>>>>>>>>>>>> Can you expand on what these checks are? Would you be
> > > > >>>>>> checking
> > > > >>>>>>> if
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>> transaction was still ongoing for example? Is there
> state
> > > > >>>>>> about
> > > > >>>>>>>>> ongoing
> > > > >>>>>>>>>>>>> transactions in the batch coordinator? I see some other
> > > > >>>> state
> > > > >>>>>>>>> mentioned
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>> the End transaction section, but it's not super clear
> > > > >>>> what
> > > > >>>>>>> state is
> > > > >>>>>>>>>>>> stored
> > > > >>>>>>>>>>>>> and when it is stored.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 3. I didn't see anything about maintaining LSO --
> > > > >>>> perhaps
> > > > >>>>>>> that
> > > > >>>>>>>>> would
> > > > >>>>>>>>>>>> be
> > > > >>>>>>>>>>>>> stored in the batch coordinator?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JO 4. Are there any thoughts about how long
> transactional
> > > > >>>>>>> state is
> > > > >>>>>>>>>>>>> maintained in the batch coordinator and how it will be
> > > > >>>>>> cleaned
> > > > >>>>>>> up?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Mon, Sep 8, 2025 at 10:38 AM Jun Rao
> > > > >>>>>>> <[email protected]>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Hi, Greg and Ivan,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Thanks for the update. A few comments.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> JR 10. "Consumer fetches are now served from local
> > > > >>>>>> segments,
> > > > >>>>>>>>> making
> > > > >>>>>>>>>>>> use of
> > > > >>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> indexes, page cache, request purgatory, and zero-copy
> > > > >>>>>>>>> functionality
> > > > >>>>>>>>>>>> already
> > > > >>>>>>>>>>>>>> built into classic topics."
> > > > >>>>>>>>>>>>>> JR 10.1 Does the broker build the producer state for
> > > > >>>> each
> > > > >>>>>>>>> partition in
> > > > >>>>>>>>>>>>>> diskless topics?
> > > > >>>>>>>>>>>>>> JR 10.2 For transactional data, the consumer fetches
> > > > >>>> need
> > > > >>>>>> to
> > > > >>>>>>> know
> > > > >>>>>>>>>>>> aborted
> > > > >>>>>>>>>>>>>> records. How is that achieved?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> JR 11. "The batch coordinator saves that the
> > > > >>>> transaction is
> > > > >>>>>>>>> finished
> > > > >>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>> also inserts the control batches in the corresponding
> > > > >>>> logs
> > > > >>>>>>> of the
> > > > >>>>>>>>>>>> involved
> > > > >>>>>>>>>>>>>> Diskless topics. This happens only on the metadata
> > > > >>>> level,
> > > > >>>>>> no
> > > > >>>>>>>>> actual
> > > > >>>>>>>>>>>> control
> > > > >>>>>>>>>>>>>> batches are written to any file. "
> > > > >>>>>>>>>>>>>> A fetch response could include multiple transactional
> > > > >>>>>>> batches.
> > > > >>>>>>>>> How
> > > > >>>>>>>>>>>> does the
> > > > >>>>>>>>>>>>>> broker obtain the information about the ending control
> > > > >>>>>> batch
> > > > >>>>>>> for
> > > > >>>>>>>>> each
> > > > >>>>>>>>>>>>>> batch? Does that mean that a fetch response needs to
> be
> > > > >>>>>>> built by
> > > > >>>>>>>>>>>>>> stitching record batches and generated control batches
> > > > >>>>>>> together?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> JR 12. Queues: Is there still a share partition leader
> > > > >>>> that
> > > > >>>>>>> all
> > > > >>>>>>>>>>>> consumers
> > > > >>>>>>>>>>>>>> are routed to?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> JR 13. "Should the KIPs be modified to include this or
> > > > >>>> it's
> > > > >>>>>>> too
> > > > >>>>>>>>>>>>>> implementation-focused?" It would be useful to include
> > > > >>>>>> enough
> > > > >>>>>>>>> details
> > > > >>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> understand correctness and performance impact.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> HC5. Henry has a valid point. Requests from a given
> > > > >>>>>> producer
> > > > >>>>>>>>> contain a
> > > > >>>>>>>>>>>>>> sequence number, which is ordered. If a producer sends
> > > > >>>>>> every
> > > > >>>>>>>>> Produce
> > > > >>>>>>>>>>>>>> request to an arbitrary broker, those requests could
> > > > >>>> reach
> > > > >>>>>>> the
> > > > >>>>>>>>> batch
> > > > >>>>>>>>>>>>>> coordinator in different order and lead to rejection
> > > > >>>> of the
> > > > >>>>>>>>> produce
> > > > >>>>>>>>>>>>>> requests.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Jun
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <
> > > > >>>>>>> [email protected]>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> We have also thought in a bit more details about
> > > > >>>>>>> transactions
> > > > >>>>>>>>> and
> > > > >>>>>>>>>>>> queues,
> > > > >>>>>>>>>>>>>>> here's the plan.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> *Transactions*
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The support for transactions in *classic topics* is
> > > > >>>> based
> > > > >>>>>>> on
> > > > >>>>>>>>> precise
> > > > >>>>>>>>>>>>>>> interactions between three actors: clients (mostly
> > > > >>>>>>> producers,
> > > > >>>>>>>>> but
> > > > >>>>>>>>>>>> also
> > > > >>>>>>>>>>>>>>> consumers), brokers (ReplicaManager and other
> > > > >>>> classes),
> > > > >>>>>> and
> > > > >>>>>>>>>>>> transaction
> > > > >>>>>>>>>>>>>>> coordinators. Brokers also run partition leaders with
> > > > >>>>>> their
> > > > >>>>>>>>> local
> > > > >>>>>>>>>>>> state
> > > > >>>>>>>>>>>>>>> (ProducerStateManager and others).
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The high level (some details skipped) workflow is the
> > > > >>>>>>>>> following.
> > > > >>>>>>>>>>>> When a
> > > > >>>>>>>>>>>>>>> transactional Produce request is received by the
> > > > >>>> broker:
> > > > >>>>>>>>>>>>>>> 1. For each partition, the partition leader checks
> > > > >>>> if a
> > > > >>>>>>>>> non-empty
> > > > >>>>>>>>>>>>>>> transaction is running for this partition. This is
> > > > >>>> done
> > > > >>>>>>> using
> > > > >>>>>>>>> its
> > > > >>>>>>>>>>>> local
> > > > >>>>>>>>>>>>>>> state derived from the log metadata
> > > > >>>>>> (ProducerStateManager,
> > > > >>>>>>>>>>>>>>> VerificationStateEntry, VerificationGuard).
> > > > >>>>>>>>>>>>>>> 2. The transaction coordinator is informed about all
> > > > >>>> the
> > > > >>>>>>>>> partitions
> > > > >>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>> aren’t part of the transaction to include them.
> > > > >>>>>>>>>>>>>>> 3. The partition leaders do additional transactional
> > > > >>>>>>> checks.
> > > > >>>>>>>>>>>>>>> 4. The partition leaders append the transactional
> > > > >>>> data to
> > > > >>>>>>>>> their logs
> > > > >>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>> update some of their state (for example, log the fact
> > > > >>>>>> that
> > > > >>>>>>> the
> > > > >>>>>>>>>>>>>> transaction
> > > > >>>>>>>>>>>>>>> is running for the partition and its first offset).
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> When the transaction is committed or aborted:
> > > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator
> > > > >>>>>>> directly
> > > > >>>>>>>>> with
> > > > >>>>>>>>>>>>>>> EndTxnRequest.
> > > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT
> > > > >>>> or
> > > > >>>>>>>>>>>> PREPARE_ABORT to
> > > > >>>>>>>>>>>>>>> its log and responds to the producer.
> > > > >>>>>>>>>>>>>>> 3. The transaction coordinator sends
> > > > >>>>>>> WriteTxnMarkersRequest to
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>> leaders
> > > > >>>>>>>>>>>>>>> of the involved partitions.
> > > > >>>>>>>>>>>>>>> 4. The partition leaders write the transaction
> > > > >>>> markers to
> > > > >>>>>>>>> their logs
> > > > >>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>> respond to the coordinator.
> > > > >>>>>>>>>>>>>>> 5. The coordinator writes the final transaction state
> > > > >>>>>>>>>>>> COMPLETE_COMMIT or
> > > > >>>>>>>>>>>>>>> COMPLETE_ABORT.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> In classic topics, partitions have leaders and lots
> > > > >>>> of
> > > > >>>>>>>>> important
> > > > >>>>>>>>>>>> state
> > > > >>>>>>>>>>>>>>> necessary for supporting this workflow is local. The
> > > > >>>> main
> > > > >>>>>>>>> challenge
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>> mapping this to Diskless comes from the fact there
> > > > >>>> are no
> > > > >>>>>>>>> partition
> > > > >>>>>>>>>>>>>>> leaders, so the corresponding pieces of state need
> > > > >>>> to be
> > > > >>>>>>>>> globalized
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> batch coordinator. We are already doing this to
> > > > >>>> support
> > > > >>>>>>>>> idempotent
> > > > >>>>>>>>>>>>>> produce.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The high level workflow for *diskless topics* would
> > > > >>>> look
> > > > >>>>>>> very
> > > > >>>>>>>>>>>> similar:
> > > > >>>>>>>>>>>>>>> 1. For each partition, the broker checks if a
> > > > >>>> non-empty
> > > > >>>>>>>>> transaction
> > > > >>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>> running for this partition. In contrast to classic
> > > > >>>>>> topics,
> > > > >>>>>>>>> this is
> > > > >>>>>>>>>>>>>> checked
> > > > >>>>>>>>>>>>>>> against the batch coordinator with a single RPC.
> > > > >>>> Since a
> > > > >>>>>>>>> transaction
> > > > >>>>>>>>>>>>>> could
> > > > >>>>>>>>>>>>>>> be uniquely identified with producer ID and epoch,
> > > > >>>> the
> > > > >>>>>>> positive
> > > > >>>>>>>>>>>> result of
> > > > >>>>>>>>>>>>>>> this check could be cached locally (for the double
> > > > >>>>>>> configured
> > > > >>>>>>>>>>>> duration
> > > > >>>>>>>>>>>>>> of a
> > > > >>>>>>>>>>>>>>> transaction, for example).
> > > > >>>>>>>>>>>>>>> 2. The same: The transaction coordinator is informed
> > > > >>>>>> about
> > > > >>>>>>> all
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>>> partitions that aren’t part of the transaction to
> > > > >>>> include
> > > > >>>>>>> them.
> > > > >>>>>>>>>>>>>>> 3. No transactional checks are done on the broker
> > > > >>>> side.
> > > > >>>>>>>>>>>>>>> 4. The broker appends the transactional data to the
> > > > >>>>>> current
> > > > >>>>>>>>> shared
> > > > >>>>>>>>>>>> WAL
> > > > >>>>>>>>>>>>>>> segment. It doesn’t update any transaction-related
> > > > >>>> state
> > > > >>>>>>> for
> > > > >>>>>>>>> Diskless
> > > > >>>>>>>>>>>>>>> topics, because it doesn’t have any.
> > > > >>>>>>>>>>>>>>> 5. The WAL segment is committed to the batch
> > > > >>>> coordinator
> > > > >>>>>>> like
> > > > >>>>>>>>> in the
> > > > >>>>>>>>>>>>>>> normal produce flow.
> > > > >>>>>>>>>>>>>>> 6. The batch coordinator does the final transactional
> > > > >>>>>>> checks
> > > > >>>>>>>>> of the
> > > > >>>>>>>>>>>>>>> batches. This procedure would output the same errors
> > > > >>>> like
> > > > >>>>>>> the
> > > > >>>>>>>>>>>> partition
> > > > >>>>>>>>>>>>>>> leader in classic topics would do. I.e. some batches
> > > > >>>>>> could
> > > > >>>>>>> be
> > > > >>>>>>>>>>>> rejected.
> > > > >>>>>>>>>>>>>>> This means, there will potentially be garbage in the
> > > > >>>> WAL
> > > > >>>>>>>>> segment
> > > > >>>>>>>>>>>> file in
> > > > >>>>>>>>>>>>>>> case of transactional errors. This is preferable to
> > > > >>>> doing
> > > > >>>>>>> more
> > > > >>>>>>>>>>>> network
> > > > >>>>>>>>>>>>>>> round trips, especially considering the WAL segments
> > > > >>>> will
> > > > >>>>>>> be
> > > > >>>>>>>>>>>> relatively
> > > > >>>>>>>>>>>>>>> short-living (see the Greg's update above).
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> When the transaction is committed or aborted:
> > > > >>>>>>>>>>>>>>> 1. The producer contacts the transaction coordinator
> > > > >>>>>>> directly
> > > > >>>>>>>>> with
> > > > >>>>>>>>>>>>>>> EndTxnRequest.
> > > > >>>>>>>>>>>>>>> 2. The transaction coordinator writes PREPARE_COMMIT
> > > > >>>> or
> > > > >>>>>>>>>>>> PREPARE_ABORT to
> > > > >>>>>>>>>>>>>>> its log and responds to the producer.
> > > > >>>>>>>>>>>>>>> 3. *[NEW]* The transaction coordinator informs the
> > > > >>>> batch
> > > > >>>>>>>>> coordinator
> > > > >>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>> the transaction is finished.
> > > > >>>>>>>>>>>>>>> 4. *[NEW]* The batch coordinator saves that the
> > > > >>>>>>> transaction is
> > > > >>>>>>>>>>>> finished
> > > > >>>>>>>>>>>>>>> and also inserts the control batches in the
> > > > >>>> corresponding
> > > > >>>>>>> logs
> > > > >>>>>>>>> of the
> > > > >>>>>>>>>>>>>>> involved Diskless topics. This happens only on the
> > > > >>>>>> metadata
> > > > >>>>>>>>> level, no
> > > > >>>>>>>>>>>>>>> actual control batches are written to any file. They
> > > > >>>> will
> > > > >>>>>>> be
> > > > >>>>>>>>>>>> dynamically
> > > > >>>>>>>>>>>>>>> created on Fetch and other read operations. We could
> > > > >>>>>>>>> technically
> > > > >>>>>>>>>>>> write
> > > > >>>>>>>>>>>>>>> these control batches for real, but this would mean
> > > > >>>> extra
> > > > >>>>>>>>> produce
> > > > >>>>>>>>>>>>>> latency,
> > > > >>>>>>>>>>>>>>> so it's better just to mark them in the batch
> > > > >>>> coordinator
> > > > >>>>>>> and
> > > > >>>>>>>>> save
> > > > >>>>>>>>>>>> these
> > > > >>>>>>>>>>>>>>> milliseconds.
> > > > >>>>>>>>>>>>>>> 5. The transaction coordinator sends
> > > > >>>>>>> WriteTxnMarkersRequest to
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>> leaders
> > > > >>>>>>>>>>>>>>> of the involved partitions. – Now only to classic
> > > > >>>> topics
> > > > >>>>>>> now.
> > > > >>>>>>>>>>>>>>> 6. The partition leaders of classic topics write the
> > > > >>>>>>>>> transaction
> > > > >>>>>>>>>>>> markers
> > > > >>>>>>>>>>>>>>> to their logs and respond to the coordinator.
> > > > >>>>>>>>>>>>>>> 7. The coordinator writes the final transaction state
> > > > >>>>>>>>>>>> COMPLETE_COMMIT or
> > > > >>>>>>>>>>>>>>> COMPLETE_ABORT.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Compared to the non-transactional produce flow, we
> > > > >>>> get:
> > > > >>>>>>>>>>>>>>> 1. An extra network round trip between brokers and
> > > > >>>> the
> > > > >>>>>>> batch
> > > > >>>>>>>>>>>> coordinator
> > > > >>>>>>>>>>>>>>> when a new partition appear in the transaction. To
> > > > >>>>>>> mitigate the
> > > > >>>>>>>>>>>> impact of
> > > > >>>>>>>>>>>>>>> them:
> > > > >>>>>>>>>>>>>>> - The results will be cached.
> > > > >>>>>>>>>>>>>>> - The calls for multiple partitions in one Produce
> > > > >>>>>>> request
> > > > >>>>>>>>> will be
> > > > >>>>>>>>>>>>>>> grouped.
> > > > >>>>>>>>>>>>>>> - The batch coordinator should be optimized for
> > > > >>>> fast
> > > > >>>>>>>>> response to
> > > > >>>>>>>>>>>> these
> > > > >>>>>>>>>>>>>>> RPCs.
> > > > >>>>>>>>>>>>>>> - The fact that a single producer normally will
> > > > >>>>>>> communicate
> > > > >>>>>>>>> with a
> > > > >>>>>>>>>>>>>>> single broker for the duration of the transaction
> > > > >>>> further
> > > > >>>>>>>>> reduces the
> > > > >>>>>>>>>>>>>>> expected number of round trips.
> > > > >>>>>>>>>>>>>>> 2. An extra round trip between the transaction
> > > > >>>>>> coordinator
> > > > >>>>>>> and
> > > > >>>>>>>>> batch
> > > > >>>>>>>>>>>>>>> coordinator when a transaction is finished.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> With this proposal, transactions will also be able to
> > > > >>>>>> span
> > > > >>>>>>> both
> > > > >>>>>>>>>>>> classic
> > > > >>>>>>>>>>>>>>> and Diskless topics.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> *Queues*
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> The share group coordination and management is a
> > > > >>>> side job
> > > > >>>>>>> that
> > > > >>>>>>>>>>>> doesn't
> > > > >>>>>>>>>>>>>>> interfere with the topic itself (leadership,
> > > > >>>> replicas,
> > > > >>>>>>> physical
> > > > >>>>>>>>>>>> storage
> > > > >>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>> records, etc.) and non-queue producers and consumers
> > > > >>>>>>> (Fetch and
> > > > >>>>>>>>>>>> Produce
> > > > >>>>>>>>>>>>>>> RPCs, consumer group-related RPCs are not affected.)
> > > > >>>> We
> > > > >>>>>>> don't
> > > > >>>>>>>>> see any
> > > > >>>>>>>>>>>>>>> reason why we can't make Diskless topics compatible
> > > > >>>> with
> > > > >>>>>>> share
> > > > >>>>>>>>>>>> groups the
> > > > >>>>>>>>>>>>>>> same way as classic topics are. Even on the code
> > > > >>>> level,
> > > > >>>>>> we
> > > > >>>>>>>>> don't
> > > > >>>>>>>>>>>> expect
> > > > >>>>>>>>>>>>>> any
> > > > >>>>>>>>>>>>>>> serious refactoring: the same reading routines are
> > > > >>>> used
> > > > >>>>>>> that
> > > > >>>>>>>>> are
> > > > >>>>>>>>>>>> used for
> > > > >>>>>>>>>>>>>>> fetching (e.g. ReplicaManager.readFromLog).
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Should the KIPs be modified to include this or it's
> > > > >>>> too
> > > > >>>>>>>>>>>>>>> implementation-focused?
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Best regards,
> > > > >>>>>>>>>>>>>>> Ivan
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> > > > >>>>>>>>>>>>>>>> Hi all,
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Thank you all for your questions and design input
> > > > >>>> on
> > > > >>>>>>>>> KIP-1150.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> We have just updated KIP-1150 and KIP-1163 with a
> > > > >>>> new
> > > > >>>>>>>>> design. To
> > > > >>>>>>>>>>>>>>> summarize
> > > > >>>>>>>>>>>>>>>> the changes:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 1. The design prioritizes integrating with the
> > > > >>>> existing
> > > > >>>>>>>>> KIP-405
> > > > >>>>>>>>>>>> Tiered
> > > > >>>>>>>>>>>>>>>> Storage interfaces, permitting data produced to a
> > > > >>>>>>> Diskless
> > > > >>>>>>>>> topic
> > > > >>>>>>>>>>>> to be
> > > > >>>>>>>>>>>>>>>> moved to tiered storage.
> > > > >>>>>>>>>>>>>>>> This lowers the scalability requirements for the
> > > > >>>> Batch
> > > > >>>>>>>>> Coordinator
> > > > >>>>>>>>>>>>>>>> component, and allows Diskless to compose with
> > > > >>>> Tiered
> > > > >>>>>>> Storage
> > > > >>>>>>>>>>>> plugin
> > > > >>>>>>>>>>>>>>>> features such as encryption and alternative data
> > > > >>>>>> formats.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 2. Consumer fetches are now served from local
> > > > >>>> segments,
> > > > >>>>>>>>> making use
> > > > >>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>> indexes, page cache, request purgatory, and
> > > > >>>> zero-copy
> > > > >>>>>>>>> functionality
> > > > >>>>>>>>>>>>>>> already
> > > > >>>>>>>>>>>>>>>> built into classic topics.
> > > > >>>>>>>>>>>>>>>> However, local segments are now considered cache
> > > > >>>>>>> elements,
> > > > >>>>>>>>> do not
> > > > >>>>>>>>>>>> need
> > > > >>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>> be durably stored, and can be built without
> > > > >>>> contacting
> > > > >>>>>>> any
> > > > >>>>>>>>> other
> > > > >>>>>>>>>>>>>>> replicas.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> 3. The design has been simplified substantially, by
> > > > >>>>>>> removing
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>> previous
> > > > >>>>>>>>>>>>>>>> Diskless consume flow, distributed cache
> > > > >>>> component, and
> > > > >>>>>>>>> "object
> > > > >>>>>>>>>>>>>>>> compaction/merging" step.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> The design maintains leaderless produces as
> > > > >>>> enabled by
> > > > >>>>>>> the
> > > > >>>>>>>>> Batch
> > > > >>>>>>>>>>>>>>>> Coordinator, and the same latency profiles as the
> > > > >>>>>> earlier
> > > > >>>>>>>>> design,
> > > > >>>>>>>>>>>> while
> > > > >>>>>>>>>>>>>>>> being simpler and integrating better into the
> > > > >>>> existing
> > > > >>>>>>>>> ecosystem.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Thanks, and we are eager to hear your feedback on
> > > > >>>> the
> > > > >>>>>> new
> > > > >>>>>>>>> design.
> > > > >>>>>>>>>>>>>>>> Greg Harris
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:30 PM Jun Rao
> > > > >>>>>>>>> <[email protected]>
> > > > >>>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Hi, Jan,
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> For me, the main gap of KIP-1150 is the support
> > > > >>>> of
> > > > >>>>>> all
> > > > >>>>>>>>> existing
> > > > >>>>>>>>>>>>>> client
> > > > >>>>>>>>>>>>>>>>> APIs. Currently, there is no design for
> > > > >>>> supporting
> > > > >>>>>> APIs
> > > > >>>>>>>>> like
> > > > >>>>>>>>>>>>>>> transactions
> > > > >>>>>>>>>>>>>>>>> and queues.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Jun
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > > > >>>>>>>>>>>>>>>>> <[email protected]> wrote:
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Would it be a good time to ask for the current
> > > > >>>>>>> status of
> > > > >>>>>>>>> this
> > > > >>>>>>>>>>>> KIP?
> > > > >>>>>>>>>>>>>> I
> > > > >>>>>>>>>>>>>>>>>> haven't seen much activity here for the past 2
> > > > >>>>>>> months,
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> vote got
> > > > >>>>>>>>>>>>>>>>> vetoed
> > > > >>>>>>>>>>>>>>>>>> but I think the pending questions have been
> > > > >>>>>> answered
> > > > >>>>>>>>> since
> > > > >>>>>>>>>>>> then.
> > > > >>>>>>>>>>>>>>> KIP-1183
> > > > >>>>>>>>>>>>>>>>>> (AutoMQ's proposal) also didn't have any
> > > > >>>> activity
> > > > >>>>>>> since
> > > > >>>>>>>>> May.
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> In my eyes KIP-1150 and KIP-1183 are two real
> > > > >>>>>> choices
> > > > >>>>>>>>> that can
> > > > >>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>>>>>> made, with a coordinator-based approach being
> > > > >>>> by
> > > > >>>>>> far
> > > > >>>>>>> the
> > > > >>>>>>>>>>>> dominant
> > > > >>>>>>>>>>>>>> one
> > > > >>>>>>>>>>>>>>>>> when
> > > > >>>>>>>>>>>>>>>>>> it comes to market adoption - but all these are
> > > > >>>>>>>>> standalone
> > > > >>>>>>>>>>>>>> products.
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> I'm a big fan of both approaches, but would
> > > > >>>> hate to
> > > > >>>>>>> see a
> > > > >>>>>>>>>>>> stall. So
> > > > >>>>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>>>>>>> question is: can we get an update?
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Maybe it's time to start another vote? Colin
> > > > >>>>>> McCabe -
> > > > >>>>>>>>> have your
> > > > >>>>>>>>>>>>>>> questions
> > > > >>>>>>>>>>>>>>>>>> been answered? If not, is there anything I can
> > > > >>>> do
> > > > >>>>>> to
> > > > >>>>>>>>> help? I'm
> > > > >>>>>>>>>>>>>> deeply
> > > > >>>>>>>>>>>>>>>>>> familiar with both architectures and have
> > > > >>>> written
> > > > >>>>>>> about
> > > > >>>>>>>>> both?
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> Kind regards,
> > > > >>>>>>>>>>>>>>>>>> Jan
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>> On Tue, Jun 24, 2025 at 10:42 AM Stanislav
> > > > >>>>>> Kozlovski
> > > > >>>>>>> <
> > > > >>>>>>>>>>>>>>>>>> [email protected]> wrote:
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> I have some nits - it may be useful to
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> a) group all the KIP email threads in the
> > > > >>>> main
> > > > >>>>>> one
> > > > >>>>>>>>> (just a
> > > > >>>>>>>>>>>> bunch
> > > > >>>>>>>>>>>>>> of
> > > > >>>>>>>>>>>>>>>>> links
> > > > >>>>>>>>>>>>>>>>>>> to everything)
> > > > >>>>>>>>>>>>>>>>>>> b) create the email threads
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> It's a bit hard to track it all - for
> > > > >>>> example, I
> > > > >>>>>>> was
> > > > >>>>>>>>>>>> searching
> > > > >>>>>>>>>>>>>> for
> > > > >>>>>>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>>>>>> discuss thread for KIP-1165 for a while; As
> > > > >>>> far
> > > > >>>>>> as
> > > > >>>>>>> I
> > > > >>>>>>>>> can
> > > > >>>>>>>>>>>> tell, it
> > > > >>>>>>>>>>>>>>>>> doesn't
> > > > >>>>>>>>>>>>>>>>>>> exist yet.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Since the KIPs are published (by virtue of
> > > > >>>> having
> > > > >>>>>>> the
> > > > >>>>>>>>> root
> > > > >>>>>>>>>>>> KIP be
> > > > >>>>>>>>>>>>>>>>>>> published, having a DISCUSS thread and links
> > > > >>>> to
> > > > >>>>>>>>> sub-KIPs
> > > > >>>>>>>>>>>> where
> > > > >>>>>>>>>>>>>> were
> > > > >>>>>>>>>>>>>>>>> aimed
> > > > >>>>>>>>>>>>>>>>>>> to move the discussion towards), I think it
> > > > >>>> would
> > > > >>>>>>> be
> > > > >>>>>>>>> good to
> > > > >>>>>>>>>>>>>> create
> > > > >>>>>>>>>>>>>>>>>> DISCUSS
> > > > >>>>>>>>>>>>>>>>>>> threads for them all.
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> Best,
> > > > >>>>>>>>>>>>>>>>>>> Stan
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>> On 2025/04/16 11:58:22 Josep Prat wrote:
> > > > >>>>>>>>>>>>>>>>>>>> Hi Kafka Devs!
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> We want to start a new KIP discussion about
> > > > >>>>>>>>> introducing a
> > > > >>>>>>>>>>>> new
> > > > >>>>>>>>>>>>>>> type of
> > > > >>>>>>>>>>>>>>>>>>>> topics that would make use of Object
> > > > >>>> Storage as
> > > > >>>>>>> the
> > > > >>>>>>>>> primary
> > > > >>>>>>>>>>>>>>> source of
> > > > >>>>>>>>>>>>>>>>>>>> storage. However, as this KIP is big we
> > > > >>>> decided
> > > > >>>>>>> to
> > > > >>>>>>>>> split it
> > > > >>>>>>>>>>>>>> into
> > > > >>>>>>>>>>>>>>>>>> multiple
> > > > >>>>>>>>>>>>>>>>>>>> related KIPs.
> > > > >>>>>>>>>>>>>>>>>>>> We have the motivational KIP-1150 (
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > > >>>>>>>>>>>>>>>>>>> )
> > > > >>>>>>>>>>>>>>>>>>>> that aims to discuss if Apache Kafka
> > > > >>>> should aim
> > > > >>>>>>> to
> > > > >>>>>>>>> have
> > > > >>>>>>>>>>>> this
> > > > >>>>>>>>>>>>>>> type of
> > > > >>>>>>>>>>>>>>>>>>>> feature at all. This KIP doesn't go onto
> > > > >>>>>> details
> > > > >>>>>>> on
> > > > >>>>>>>>> how to
> > > > >>>>>>>>>>>>>>> implement
> > > > >>>>>>>>>>>>>>>>>> it.
> > > > >>>>>>>>>>>>>>>>>>>> This follows the same approach used when we
> > > > >>>>>>> discussed
> > > > >>>>>>>>>>>> KRaft.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> But as we know that it is sometimes really
> > > > >>>> hard
> > > > >>>>>>> to
> > > > >>>>>>>>> discuss
> > > > >>>>>>>>>>>> on
> > > > >>>>>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>>> meta
> > > > >>>>>>>>>>>>>>>>>>>> level, we also created several sub-kips
> > > > >>>> (linked
> > > > >>>>>>> in
> > > > >>>>>>>>>>>> KIP-1150)
> > > > >>>>>>>>>>>>>> that
> > > > >>>>>>>>>>>>>>>>> offer
> > > > >>>>>>>>>>>>>>>>>>> an
> > > > >>>>>>>>>>>>>>>>>>>> implementation of this feature.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> We kindly ask you to use the proper DISCUSS
> > > > >>>>>>> threads
> > > > >>>>>>>>> for
> > > > >>>>>>>>>>>> each
> > > > >>>>>>>>>>>>>>> type of
> > > > >>>>>>>>>>>>>>>>>>>> concern and keep this one to discuss
> > > > >>>> whether
> > > > >>>>>>> Apache
> > > > >>>>>>>>> Kafka
> > > > >>>>>>>>>>>> wants
> > > > >>>>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>>>>> have
> > > > >>>>>>>>>>>>>>>>>>>> this feature or not.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> Thanks in advance on behalf of all the
> > > > >>>> authors
> > > > >>>>>> of
> > > > >>>>>>>>> this KIP.
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>> ------------------
> > > > >>>>>>>>>>>>>>>>>>>> Josep Prat
> > > > >>>>>>>>>>>>>>>>>>>> Open Source Engineering Director, Aiven
> > > > >>>>>>>>>>>>>>>>>>>> [email protected]   |   +491715557497 |
> > > > >>>>>>> aiven.io
> > > > >>>>>>>>>>>>>>>>>>>> Aiven Deutschland GmbH
> > > > >>>>>>>>>>>>>>>>>>>> Alexanderufer 3-7, 10117 Berlin
> > > > >>>>>>>>>>>>>>>>>>>> Geschäftsführer: Oskari Saarenmaa, Hannu
> > > > >>>>>>> Valtonen,
> > > > >>>>>>>>>>>>>>>>>>>> Anna Richardson, Kenneth Chen
> > > > >>>>>>>>>>>>>>>>>>>> Amtsgericht Charlottenburg, HRB 209739 B
> > > > >>>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>


-- 
Anatolii Popov
Senior Software Developer, *Aiven OY*
m: +358505126242
w: aiven.io  e: [email protected]
<https://www.facebook.com/aivencloud>
<https://www.linkedin.com/company/aiven/>   <https://twitter.com/aiven_io>

Reply via email to