Re: [DISCUSS] KIP-1150 Diskless Topics

Justine Olshan Wed, 17 Sep 2025 10:42:06 -0700

Hey Ivan,

Thanks for the response. I think most of what you said made sense, but I
did have some questions about this part:


> As we understand this, the partition leader in classic topics forgets
about a transaction once it’s replicated (HWM overpasses it). The
transaction coordinator acts like the main guardian, allowing partition
leaders to do this safely. Please correct me if this is wrong. We think
about relying on this with the batch coordinator and delete the information
about a transaction once it’s finished (as there’s no replication and HWM
advances immediately).

I didn't quite understand this. In classic topics, we have maps for ongoing
transactions which remove state when the transaction is completed and an
aborted transactions index which is retained for much longer. Once the
transaction is completed, the coordinator is no longer involved in
maintaining this partition side state, and it is subject to compaction etc.
Looking back at the outline provided above, I didn't see much about the
fetch path, so maybe that could be expanded a bit further. I saw the
following in a response:
> When the broker constructs a fully valid local segment, all the necessary
control batches will be inserted and indices, including the transaction
index will be built to serve FetchRequests exactly as they are today.

Based on this, it seems like we need to retain the information about
aborted txns for longer.

Thanks,
Justine

On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko <[email protected]> wrote:

> Hi Justine and all,
>
> Thank you for your questions!
>
> > JO 1. >Since a transaction could be uniquely identified with producer ID
> > and epoch, the positive result of this check could be cached locally
> > Are we saying that only new transaction version 2 transactions can be
> used
> > here? If not, we can't uniquely identify transactions with producer id +
> > epoch
>
> You’re right that we (probably unintentionally) focused only on version 2.
> We can either limit the support to version 2 or consider using some
> surrogates to support version 1.
>
> > JO 2. >The batch coordinator does the final transactional checks of the
> > batches. This procedure would output the same errors like the partition
> > leader in classic topics would do.
> > Can you expand on what these checks are? Would you be checking if the
> > transaction was still ongoing for example?* *
>
> Yes, the producer epoch, that the transaction is ongoing, and of course
> the normal idempotence checks. What the partition leader in the classic
> topics does before appending a batch to the local log (e.g. in
> UnifiedLog.maybeStartTransactionVerification and
> UnifiedLog.analyzeAndValidateProducerState). In Diskless, we unfortunately
> cannot do these checks before appending the data to the WAL segment and
> uploading it, but we can “tombstone” these batches in the batch coordinator
> during the final commit.
>
> > Is there state about ongoing
> > transactions in the batch coordinator? I see some other state mentioned
> in
> > the End transaction section, but it's not super clear what state is
> stored
> > and when it is stored.
>
> Right, this should have been more explicit. As the partition leader tracks
> ongoing transactions for classic topics, the batch coordinator has to as
> well. So when a transaction starts and ends, the transaction coordinator
> must inform the batch coordinator about this.
>
> > JO 3. I didn't see anything about maintaining LSO -- perhaps that would
> be
> > stored in the batch coordinator?
>
> Yes. This could be deduced from the committed batches and other
> information, but for the sake of performance we’d better store it
> explicitly.
>
> > JO 4. Are there any thoughts about how long transactional state is
> > maintained in the batch coordinator and how it will be cleaned up?
>
> As we understand this, the partition leader in classic topics forgets
> about a transaction once it’s replicated (HWM overpasses it). The
> transaction coordinator acts like the main guardian, allowing partition
> leaders to do this safely. Please correct me if this is wrong. We think
> about relying on this with the batch coordinator and delete the information
> about a transaction once it’s finished (as there’s no replication and HWM
> advances immediately).
>
> Best,
> Ivan
>
> On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
> > Hey folks,
> >
> > Excited to see some updates related to transactions!
> >
> > I had a few questions.
> >
> > JO 1. >Since a transaction could be uniquely identified with producer ID
> > and epoch, the positive result of this check could be cached locally
> > Are we saying that only new transaction version 2 transactions can be
> used
> > here? If not, we can't uniquely identify transactions with producer id +
> > epoch
> >
> > JO 2. >The batch coordinator does the final transactional checks of the
> > batches. This procedure would output the same errors like the partition
> > leader in classic topics would do.
> > Can you expand on what these checks are? Would you be checking if the
> > transaction was still ongoing for example? Is there state about ongoing
> > transactions in the batch coordinator? I see some other state mentioned
> in
> > the End transaction section, but it's not super clear what state is
> stored
> > and when it is stored.
> >
> > JO 3. I didn't see anything about maintaining LSO -- perhaps that would
> be
> > stored in the batch coordinator?
> >
> > JO 4. Are there any thoughts about how long transactional state is
> > maintained in the batch coordinator and how it will be cleaned up?
> >
> > On Mon, Sep 8, 2025 at 10:38 AM Jun Rao <[email protected]>
> wrote:
> >
> > > Hi, Greg and Ivan,
> > >
> > > Thanks for the update. A few comments.
> > >
> > > JR 10. "Consumer fetches are now served from local segments, making
> use of
> > > the
> > > indexes, page cache, request purgatory, and zero-copy functionality
> already
> > > built into classic topics."
> > > JR 10.1 Does the broker build the producer state for each partition in
> > > diskless topics?
> > > JR 10.2 For transactional data, the consumer fetches need to know
> aborted
> > > records. How is that achieved?
> > >
> > > JR 11. "The batch coordinator saves that the transaction is finished
> and
> > > also inserts the control batches in the corresponding logs of the
> involved
> > > Diskless topics. This happens only on the metadata level, no actual
> control
> > > batches are written to any file. "
> > > A fetch response could include multiple transactional batches. How
> does the
> > > broker obtain the information about the ending control batch for each
> > > batch? Does that mean that a fetch response needs to be built by
> > > stitching record batches and generated control batches together?
> > >
> > > JR 12. Queues: Is there still a share partition leader that all
> consumers
> > > are routed to?
> > >
> > > JR 13. "Should the KIPs be modified to include this or it's too
> > > implementation-focused?" It would be useful to include enough details
> to
> > > understand correctness and performance impact.
> > >
> > > HC5. Henry has a valid point. Requests from a given producer contain a
> > > sequence number, which is ordered. If a producer sends every Produce
> > > request to an arbitrary broker, those requests could reach the batch
> > > coordinator in different order and lead to rejection of the produce
> > > requests.
> > >
> > > Jun
> > >
> > > On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We have also thought in a bit more details about transactions and
> queues,
> > > > here's the plan.
> > > >
> > > > *Transactions*
> > > >
> > > > The support for transactions in *classic topics* is based on precise
> > > > interactions between three actors: clients (mostly producers, but
> also
> > > > consumers), brokers (ReplicaManager and other classes), and
> transaction
> > > > coordinators. Brokers also run partition leaders with their local
> state
> > > > (ProducerStateManager and others).
> > > >
> > > > The high level (some details skipped) workflow is the following.
> When a
> > > > transactional Produce request is received by the broker:
> > > > 1. For each partition, the partition leader checks if a non-empty
> > > > transaction is running for this partition. This is done using its
> local
> > > > state derived from the log metadata (ProducerStateManager,
> > > > VerificationStateEntry, VerificationGuard).
> > > > 2. The transaction coordinator is informed about all the partitions
> that
> > > > aren’t part of the transaction to include them.
> > > > 3. The partition leaders do additional transactional checks.
> > > > 4. The partition leaders append the transactional data to their logs
> and
> > > > update some of their state (for example, log the fact that the
> > > transaction
> > > > is running for the partition and its first offset).
> > > >
> > > > When the transaction is committed or aborted:
> > > > 1. The producer contacts the transaction coordinator directly with
> > > > EndTxnRequest.
> > > > 2. The transaction coordinator writes PREPARE_COMMIT or
> PREPARE_ABORT to
> > > > its log and responds to the producer.
> > > > 3. The transaction coordinator sends WriteTxnMarkersRequest to the
> > > leaders
> > > > of the involved partitions.
> > > > 4. The partition leaders write the transaction markers to their logs
> and
> > > > respond to the coordinator.
> > > > 5. The coordinator writes the final transaction state
> COMPLETE_COMMIT or
> > > > COMPLETE_ABORT.
> > > >
> > > > In classic topics, partitions have leaders and lots of important
> state
> > > > necessary for supporting this workflow is local. The main challenge
> in
> > > > mapping this to Diskless comes from the fact there are no partition
> > > > leaders, so the corresponding pieces of state need to be globalized
> in
> > > the
> > > > batch coordinator. We are already doing this to support idempotent
> > > produce.
> > > >
> > > > The high level workflow for *diskless topics* would look very
> similar:
> > > > 1. For each partition, the broker checks if a non-empty transaction
> is
> > > > running for this partition. In contrast to classic topics, this is
> > > checked
> > > > against the batch coordinator with a single RPC. Since a transaction
> > > could
> > > > be uniquely identified with producer ID and epoch, the positive
> result of
> > > > this check could be cached locally (for the double configured
> duration
> > > of a
> > > > transaction, for example).
> > > > 2. The same: The transaction coordinator is informed about all the
> > > > partitions that aren’t part of the transaction to include them.
> > > > 3. No transactional checks are done on the broker side.
> > > > 4. The broker appends the transactional data to the current shared
> WAL
> > > > segment. It doesn’t update any transaction-related state for Diskless
> > > > topics, because it doesn’t have any.
> > > > 5. The WAL segment is committed to the batch coordinator like in the
> > > > normal produce flow.
> > > > 6. The batch coordinator does the final transactional checks of the
> > > > batches. This procedure would output the same errors like the
> partition
> > > > leader in classic topics would do. I.e. some batches could be
> rejected.
> > > > This means, there will potentially be garbage in the WAL segment
> file in
> > > > case of transactional errors. This is preferable to doing more
> network
> > > > round trips, especially considering the WAL segments will be
> relatively
> > > > short-living (see the Greg's update above).
> > > >
> > > > When the transaction is committed or aborted:
> > > > 1. The producer contacts the transaction coordinator directly with
> > > > EndTxnRequest.
> > > > 2. The transaction coordinator writes PREPARE_COMMIT or
> PREPARE_ABORT to
> > > > its log and responds to the producer.
> > > > 3. *[NEW]* The transaction coordinator informs the batch coordinator
> that
> > > > the transaction is finished.
> > > > 4. *[NEW]* The batch coordinator saves that the transaction is
> finished
> > > > and also inserts the control batches in the corresponding logs of the
> > > > involved Diskless topics. This happens only on the metadata level, no
> > > > actual control batches are written to any file. They will be
> dynamically
> > > > created on Fetch and other read operations. We could technically
> write
> > > > these control batches for real, but this would mean extra produce
> > > latency,
> > > > so it's better just to mark them in the batch coordinator and save
> these
> > > > milliseconds.
> > > > 5. The transaction coordinator sends WriteTxnMarkersRequest to the
> > > leaders
> > > > of the involved partitions. – Now only to classic topics now.
> > > > 6. The partition leaders of classic topics write the transaction
> markers
> > > > to their logs and respond to the coordinator.
> > > > 7. The coordinator writes the final transaction state
> COMPLETE_COMMIT or
> > > > COMPLETE_ABORT.
> > > >
> > > > Compared to the non-transactional produce flow, we get:
> > > > 1. An extra network round trip between brokers and the batch
> coordinator
> > > > when a new partition appear in the transaction. To mitigate the
> impact of
> > > > them:
> > > >   - The results will be cached.
> > > >   - The calls for multiple partitions in one Produce request will be
> > > > grouped.
> > > >   - The batch coordinator should be optimized for fast response to
> these
> > > > RPCs.
> > > >   - The fact that a single producer normally will communicate with a
> > > > single broker for the duration of the transaction further reduces the
> > > > expected number of round trips.
> > > > 2. An extra round trip between the transaction coordinator and batch
> > > > coordinator when a transaction is finished.
> > > >
> > > > With this proposal, transactions will also be able to span both
> classic
> > > > and Diskless topics.
> > > >
> > > > *Queues*
> > > >
> > > > The share group coordination and management is a side job that
> doesn't
> > > > interfere with the topic itself (leadership, replicas, physical
> storage
> > > of
> > > > records, etc.) and non-queue producers and consumers (Fetch and
> Produce
> > > > RPCs, consumer group-related RPCs are not affected.) We don't see any
> > > > reason why we can't make Diskless topics compatible with share
> groups the
> > > > same way as classic topics are. Even on the code level, we don't
> expect
> > > any
> > > > serious refactoring: the same reading routines are used that are
> used for
> > > > fetching (e.g. ReplicaManager.readFromLog).
> > > >
> > > >
> > > > Should the KIPs be modified to include this or it's too
> > > > implementation-focused?
> > > >
> > > > Best regards,
> > > > Ivan
> > > >
> > > > On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> > > > > Hi all,
> > > > >
> > > > > Thank you all for your questions and design input on KIP-1150.
> > > > >
> > > > > We have just updated KIP-1150 and KIP-1163 with a new design. To
> > > > summarize
> > > > > the changes:
> > > > >
> > > > > 1. The design prioritizes integrating with the existing KIP-405
> Tiered
> > > > > Storage interfaces, permitting data produced to a Diskless topic
> to be
> > > > > moved to tiered storage.
> > > > > This lowers the scalability requirements for the Batch Coordinator
> > > > > component, and allows Diskless to compose with Tiered Storage
> plugin
> > > > > features such as encryption and alternative data formats.
> > > > >
> > > > > 2. Consumer fetches are now served from local segments, making use
> of
> > > the
> > > > > indexes, page cache, request purgatory, and zero-copy functionality
> > > > already
> > > > > built into classic topics.
> > > > > However, local segments are now considered cache elements, do not
> need
> > > to
> > > > > be durably stored, and can be built without contacting any other
> > > > replicas.
> > > > >
> > > > > 3. The design has been simplified substantially, by removing the
> > > previous
> > > > > Diskless consume flow, distributed cache component, and "object
> > > > > compaction/merging" step.
> > > > >
> > > > > The design maintains leaderless produces as enabled by the Batch
> > > > > Coordinator, and the same latency profiles as the earlier design,
> while
> > > > > being simpler and integrating better into the existing ecosystem.
> > > > >
> > > > > Thanks, and we are eager to hear your feedback on the new design.
> > > > > Greg Harris
> > > > >
> > > > > On Mon, Jul 21, 2025 at 3:30 PM Jun Rao <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hi, Jan,
> > > > > >
> > > > > > For me, the main gap of KIP-1150 is the support of all existing
> > > client
> > > > > > APIs. Currently, there is no design for supporting APIs like
> > > > transactions
> > > > > > and queues.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > > Would it be a good time to ask for the current status of this
> KIP?
> > > I
> > > > > > > haven't seen much activity here for the past 2 months, the
> vote got
> > > > > > vetoed
> > > > > > > but I think the pending questions have been answered since
> then.
> > > > KIP-1183
> > > > > > > (AutoMQ's proposal) also didn't have any activity since May.
> > > > > > >
> > > > > > > In my eyes KIP-1150 and KIP-1183 are two real choices that can
> be
> > > > > > > made, with a coordinator-based approach being by far the
> dominant
> > > one
> > > > > > when
> > > > > > > it comes to market adoption - but all these are standalone
> > > products.
> > > > > > >
> > > > > > > I'm a big fan of both approaches, but would hate to see a
> stall. So
> > > > the
> > > > > > > question is: can we get an update?
> > > > > > >
> > > > > > > Maybe it's time to start another vote? Colin McCabe - have your
> > > > questions
> > > > > > > been answered? If not, is there anything I can do to help? I'm
> > > deeply
> > > > > > > familiar with both architectures and have written about both?
> > > > > > >
> > > > > > > Kind regards,
> > > > > > > Jan
> > > > > > >
> > > > > > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > > I have some nits - it may be useful to
> > > > > > > >
> > > > > > > > a) group all the KIP email threads in the main one (just a
> bunch
> > > of
> > > > > > links
> > > > > > > > to everything)
> > > > > > > > b) create the email threads
> > > > > > > >
> > > > > > > > It's a bit hard to track it all - for example, I was
> searching
> > > for
> > > > a
> > > > > > > > discuss thread for KIP-1165 for a while; As far as I can
> tell, it
> > > > > > doesn't
> > > > > > > > exist yet.
> > > > > > > >
> > > > > > > > Since the KIPs are published (by virtue of having the root
> KIP be
> > > > > > > > published, having a DISCUSS thread and links to sub-KIPs
> where
> > > were
> > > > > > aimed
> > > > > > > > to move the discussion towards), I think it would be good to
> > > create
> > > > > > > DISCUSS
> > > > > > > > threads for them all.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Stan
> > > > > > > >
> > > > > > > > On 2025/04/16 11:58:22 Josep Prat wrote:
> > > > > > > > > Hi Kafka Devs!
> > > > > > > > >
> > > > > > > > > We want to start a new KIP discussion about introducing a
> new
> > > > type of
> > > > > > > > > topics that would make use of Object Storage as the primary
> > > > source of
> > > > > > > > > storage. However, as this KIP is big we decided to split it
> > > into
> > > > > > > multiple
> > > > > > > > > related KIPs.
> > > > > > > > > We have the motivational KIP-1150 (
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > > > > > > )
> > > > > > > > > that aims to discuss if Apache Kafka should aim to have
> this
> > > > type of
> > > > > > > > > feature at all. This KIP doesn't go onto details on how to
> > > > implement
> > > > > > > it.
> > > > > > > > > This follows the same approach used when we discussed
> KRaft.
> > > > > > > > >
> > > > > > > > > But as we know that it is sometimes really hard to discuss
> on
> > > > that
> > > > > > meta
> > > > > > > > > level, we also created several sub-kips (linked in
> KIP-1150)
> > > that
> > > > > > offer
> > > > > > > > an
> > > > > > > > > implementation of this feature.
> > > > > > > > >
> > > > > > > > > We kindly ask you to use the proper DISCUSS threads for
> each
> > > > type of
> > > > > > > > > concern and keep this one to discuss whether Apache Kafka
> wants
> > > > to
> > > > > > have
> > > > > > > > > this feature or not.
> > > > > > > > >
> > > > > > > > > Thanks in advance on behalf of all the authors of this KIP.
> > > > > > > > >
> > > > > > > > > ------------------
> > > > > > > > > Josep Prat
> > > > > > > > > Open Source Engineering Director, Aiven
> > > > > > > > > [email protected]   |   +491715557497 | aiven.io
> > > > > > > > > Aiven Deutschland GmbH
> > > > > > > > > Alexanderufer 3-7, 10117 Berlin
> > > > > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > > > > > > > > Anna Richardson, Kenneth Chen
> > > > > > > > > Amtsgericht Charlottenburg, HRB 209739 B
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to