Re: [DISCUSS] KIP-1150 Diskless Topics

Ivan Yurchenko Sat, 18 Oct 2025 12:00:11 -0700

Hi Jun,

Thank you for your message. I’m sorry that I failed to clearly explain the 
idea. Let me try to fix this.


> Does each partition now have a metadata partition and a separate data
> partition? If so, I am concerned that it essentially doubles the number of
> partitions, which impacts the number of open file descriptors and the
> required IOPS, and so on. It also seems wasteful to have a separate
> partition just to store the metadata. It's as if we are creating an
> internal topic with an unbounded number of partitions.

No. There will be only one physical partition per diskless partition. Let me 
explain this with an example. Let’s say we have a diskless partition topic-0. 
It has three replicas 0, 1, 2; 0 is the leader. We produce some batches to this 
partition. The content of the segment file will be something like this (for 
each batch):

BaseOffset: 00000000000000000000 (like in classic)
Length: 123456 (like in classic)
PartitionLeaderEpoch: like in classic
Magic: like in classic
CRC: like in classic
Attributes: like in classic
LastOffsetDelta: like in classic
BaseTimestamp: like in classic
MaxTimestamp: like in classic
ProducerId: like in classic
ProducerEpoch: like in classic
BaseSequence: like in classic
RecordsCount: like in classic
Records:
  path/to/wal/files/5b55c4bb-f52a-4204-aea6-81226895158a; byte offset 123456

It looks very much like classic log entries. The only difference is that 
instead of writing real Records, we write the reference to the WAL file with 
the batch data (I guess we need only the name and the byte offset, because the 
byte length is the standard field above). Otherwise, it’s a normal Kafka log 
with the leader and replicas. 

So we have as many partitions for diskless as for classic. As of open file 
descriptors, let’s proceed to the following:

> Are the metadata and
> the data for the same partition always collocated on the same broker? If
> so, how do we enforce that when replicas are reassigned?

The source of truth for the data is still in WAL files on object storage. The 
source of truth for the metadata is in segment files on the brokers in the 
replica set. Two new mechanisms are planned, both independent of this new 
proposal, but I want to present them to give the idea that only a limited 
amount of data files will be operated locally:
- We want to assemble batches into segment files and offload them to tiered 
storage in order to prevent the unbounded growth of batch metadata. For this, 
we need to open only  a few file descriptors (for the segment file itself + the 
necessary indexes) before the segment is fully written and handed over to 
RemoteLogManager.
- We want to assemble local segment files for caching purposes as well, i.e. to 
speed up fetching. This will not materialize the full content of the log, but 
only the hot set according to some policy (or configurable policies), i.e. the 
number of segments and file descriptors will also be limited.

> The number of RPCs in the produce path is significantly higher. For
> example, if a produce request has 100 partitions, in a cluster with 100
> brokers, each produce request could generate 100 more RPC requests. This
> will significantly increase the request rate.

This is a valid concern that we considered, but this issue can be mitigated. 
I’ll try to explain the approach.
The situation with a single broker is trivial: all the commit requests go from 
the broker to itself. 
Let’s scale this to a multi-broker cluster, but located in the single rack 
(AZ). Any broker can accept Produce requests for diskless partitions, but we 
can tell producers (through metadata) to always send Produce requests to 
leaders. For example, broker 0 hosts the leader replicas for diskless 
partitions t1-0, t2-1, t3-0. It will receive diskless Produce requests for 
these partitions in various combinations, but only for them.

                    Broker 0
              +-----------------+
              |    t1-0         |
              |    t2-1 <--------------------+
              |    t3-0         |            |
 produce      | +-------------+ |            |
 requests     | |  diskless   | |            |
--------------->|   produce   +--------------+
 for these    | | WAL buffer  | |    commit requests
 partitions   | +-------------+ |    for these partitions
              |                 |
              +-----------------+

The same applies for other brokers in this cluster. Effectively, each broker 
will commit only to itself, which effectively means 1 commit request per WAL 
buffer (this may be 0 physical network calls, if we wish, just a local function 
call).

Now let’s scale this to multiple racks (AZs). Obviously, we cannot always send 
Produce requests to the designated leaders of diskless partitions: this would 
mean inter-AZ network traffic, which we would like to avoid. To avoid it, we 
say that every broker has a “diskless produce representative” in every AZ. If 
we continue our example: when a Produce request for t1-0, t2-1, or t3-0 comes 
from a producer in AZ 0, it lands on broker 0 (in the broker’s AZ the 
representative is the broker itself). However, if it comes from AZ 1, it lands 
on broker 1; in AZ 2, it’s broker 2. 

  |produce requests         |produce requests        |produce requests
  |for t1-0, t2-1, t3-0     |for t1-0, t2-1, t3-0    |for t1-0, t2-1, t3-0
  |from AZ 0                |from AZ 1               |from AZ 2
  v                         v                        v
 Broker 0 (AZ 0)        Broker 1 (AZ 1)        Broker 2 (AZ 2)
+---------------+      +---------------+      +---------------+
|     t1-0      |      |               |      |               |
|     t2-1      |      |               |      |               |
|     t3-0      |      |               |      |               |
+---------------+      +--------+------+      +--------+------+
     ^     ^                    |                      |
     |     +--------------------+                      |
     |     commit requests for these partitions        |
     |                                                 |
     +-------------------------------------------------+
           commit requests for these partitions

All the partitions that broker 0 is the leader of will be “represented” by 
brokers 1 and 2 in their AZs.

Of course, this relationship goes both ways between AZs (not necessarily 
between the same brokers). It means that provided the cluster is balanced by 
the number of brokers per AZ, each broker will represent (number_of_azs - 1) 
other brokers. This will result in the situation that for the majority of 
commits, each broker will do up to (number_of_azs - 1) network commit requests 
(plus one local). Cloud regions tend to have 3 AZs, very rarely more. That 
means, brokers will be doing up to 2 network commit requests per WAL file.

There are the following exceptions:
1. Broker count imbalance between AZs. For example, when we have 2 AZs and one 
has three brokers and another AZ has one. This one broker will do between 1 and 
3 commit requests per WAL file. This is not an extreme amplification. Such an 
imbalance is not healthy in most practical setups and should be avoided anyway.
2. Leadership changes and metadata propagation period. When the partition t3-0 
is relocated from broker 0 to some broker 3, the producers will not know this 
immediately (unless we want to be strict and respond with 
NOT_LEADER_OR_FOLLOWER). So if t1-0, t2-1, and t3-0 will come together in a WAL 
buffer on broker 2, it will have to send two commit requests: to broker 0 to 
commit t1-0 and t2-1, and to broker 3 to commit t3-0. This situation is not 
permanent and as producers update the cluster metadata, it will be resolved.

This all could be built with the metadata crafting mechanism only (which is 
anyway needed for Diskless in one way or another to direct producers and 
consumers where we need to avoid inter-AZ traffic), just with the right policy 
for it (for example, some deterministic hash-based formula). I.e. no explicit 
support for “produce representative” or anything like this is needed on the 
cluster level, in KRaft, etc.

> The same WAL file metadata is now duplicated into two places, partition
> leader and WAL File Manager. Which one is the source of truth, and how do
> we maintain consistency between the two places?

We do only two operations on WAL files that span multiple diskless partitions: 
committing and deleting. Commits can be done independently as described above. 
But deletes are different, because when a file is deleted, this affects all the 
partitions that still have alive batches in this file (if any).

The WAL file manager is a necessary point of coordination to delete WAL files 
safely. We can say it is the source of truth about files themselves, while the 
partition leaders and their logs hold the truth about whether a particular file 
contains live batches of this particular partition.

The file manager will do this important task: be able to say for sure that a 
file does not contain any live batch of any existing partition. For this, it 
will have to periodically check against the partition leaders. Considering that 
batch deletion is irreversible, when we declare a file “empty”, this is 
guaranteed to be and stay so.

The file manager has to know about files being committed to start track them 
and periodically check if they are empty. We can consider various ways to 
achieve this:
1. As was proposed in my previous message: best effort commit by brokers + 
periodic prefix scans of object storage to detect files that went below the 
radar due to network issue or the file manager temporary unavailability. We’re 
speaking about listing the file names only and opening only previously unknown 
files in order to find the partitions involved with them.
2. Only do scans without explicit commit, i.e. fill the list of files fully 
asynchronously and in the background. This may be not ideal due to costs and 
performance of scanning tons of files. However, the number of live WAL files 
should be limited due to tiered storage offloading + we can optimize this if we 
give files some global soft order in their names.

> I am not sure how this design simplifies the implementation. The existing
> producer/replication code can't be simply reused. Adjusting both the write
> path in the leader and the replication path in the follower to understand
> batch-header only data is quite intrusive to the existing logic.

It is true that we’ll have to change LocalLog and UnifiedLog in order to 
support these changes. However, it seems that idempotence, transactions, 
queues, tiered storage will have to be changed less than with the original 
design. This is because the partition leader state would remain in the same 
place (on brokers) and existing workflows that involve it would have to be 
changed less compared to the situation where we globalize the partition leader 
state in the batch coordinator. I admit this is hard to make convincing without 
both real implementations to hand :) 

> I am also
> not sure how this enables seamless switching the topic modes between
> diskless and classic. Could you provide more details on those?

Let’s consider the scenario of turning a classic topic into diskless. The user 
sets diskless.enabled=true, the leader receives this metadata update and does 
the following:
1. Stop accepting normal append writes.
2. Close the current active segment.
3. Start a new segment that will be written in the diskless format (i.e. 
without data).
4. Start accepting diskless commits.

Since it’s the same log, the followers will know about that switch 
consistently. They will finish replicating the classic segments and start 
replicating the diskless ones. They will always know where each batch is 
located (either inside a classic segment or referenced by a diskless one). 
Switching back should be similar.

Doing this with the coordinator is possible, but has some caveats. The leader 
must do the following:
1. Stop accepting normal append writes.
2. Close the current active segment.
3. Write a special control segment to persist and replicate the fact that from 
offset N the partition is now in the diskless mode.
4. Inform the coordinator about the first offset N of the “diskless era”.
5. Inform the controller quorum that the transition has finished and that 
brokers now can process diskless writes for this partition.
This could fail at some points, so this will probably require some explicit 
state machine with replication either in the partition log or in KRaft.

It seems that the coordinator-less approach makes this simpler because the 
“coordinator” for the partition and the partition leader are the same and they 
store the partition metadata in the same log, too. While in the coordinator 
approach we have to perform some kind of a distributed commit to handover 
metadata management from the classic partition leader to the batch coordinator.

I hope these explanations help to clarify the idea. Please let me know if I 
should go deeper anywhere.

Best,
Ivan and the Diskless team

On Tue, Oct 7, 2025, at 01:44, Jun Rao wrote:
> Hi, Ivan,
> 
> Thanks for the update.
> 
> I am not sure that I fully understand the new design, but it seems less
> clean than before.
> 
> Does each partition now have a metadata partition and a separate data
> partition? If so, I am concerned that it essentially doubles the number of
> partitions, which impacts the number of open file descriptors and the
> required IOPS, and so on. It also seems wasteful to have a separate
> partition just to store the metadata. It's as if we are creating an
> internal topic with an unbounded number of partitions. Are the metadata and
> the data for the same partition always collocated on the same broker? If
> so, how do we enforce that when replicas are reassigned?
> 
> The number of RPCs in the produce path is significantly higher. For
> example, if a produce request has 100 partitions, in a cluster with 100
> brokers, each produce request could generate 100 more RPC requests. This
> will significantly increase the request rate.
> 
> The same WAL file metadata is now duplicated into two places, partition
> leader and WAL File Manager. Which one is the source of truth, and how do
> we maintain consistency between the two places?
> 
> I am not sure how this design simplifies the implementation. The existing
> producer/replication code can't be simply reused. Adjusting both the write
> path in the leader and the replication path in the follower to understand
> batch-header only data is quite intrusive to the existing logic. I am also
> not sure how this enables seamless switching the topic modes between
> diskless and classic. Could you provide more details on those?
> 
> Jun
> 
> On Thu, Oct 2, 2025 at 5:08 AM Ivan Yurchenko <[email protected]> wrote:
> 
> > Hi dear Kafka community,
> >
> > In the initial Diskless proposal, we proposed to have a separate
> > component, batch/diskless coordinator, whose role would be to centrally
> > manage the batch and WAL file metadata for diskless topics. This component
> > drew many reasonable comments from the community about how it would support
> > various Kafka features (transactions, queues) and its scalability. While we
> > believe we have good answers to all the expressed concerns, we took a step
> > back and looked at the problem from a different perspective.
> >
> > We would like to propose an alternative Diskless design *without a
> > centralized coordinator*. We believe this approach has potential and
> > propose to discuss it as it may be more appealing to the community.
> >
> > Let us explain the idea. Most of the complications with the original
> > Diskless approach come from one necessary architecture change: globalizing
> > the local state of partition leader in the batch coordinator. This causes
> > deviations to the established workflows in various features like produce
> > idempotence and transactions, queues, retention, etc. These deviations need
> > to be carefully considered, designed, and later implemented and tested. In
> > the new approach we want to avoid this by making partition leaders again
> > responsible for managing their partitions, even in diskless topics.
> >
> > In classic Kafka topics, batch data and metadata are blended together in
> > the one partition log. The crux of the Diskless idea is to decouple them
> > and move data to the remote storage, while keeping metadata somewhere else.
> > Using the central batch coordinator for managing batch metadata is one way,
> > but not the only.
> >
> > Let’s now think about managing metadata for each user partition
> > independently. Generally partitions are independent and don’t share
> > anything apart from that their data are mixed in WAL files. If we figure
> > out how to commit and later delete WAL files safely, we will achieve the
> > necessary autonomy that allows us to get rid of the central batch
> > coordinator. Instead, *each diskless user partition will be managed by its
> > leader*, as in classic Kafka topics. Also like in classic topics, the
> > leader uses the partition log as the way to persist batch metadata, i.e.
> > the regular batch header + the information about how to find this batch on
> > remote storage. In contrast to classic topics, batch data is in remote
> > storage.
> >
> > For clarity, let’s compare the three designs:
> >  • Classic topics:
> >    • Data and metadata are co-located in the partition log.
> >    • The partition log content: [Batch header (metadata)|Batch data].
> >    • The partition log is replicated to the followers.
> >    • The replicas and leader have local state built from metadata.
> >  • Original Diskless:
> >    • Metadata is in the batch coordinator, data is on remote storage.
> >    • The partition state is global in the batch coordinator.
> >  • New Diskless:
> >    • Metadata is in the partition log, data is on remote storage.
> >    • Partition log content: [Batch header (metadata)|Batch coordinates on
> > remote storage].
> >    • The partition log is replicated to the followers.
> >    • The replicas and leader have local state built from metadata.
> >
> > Let’s consider the produce path. Here’s the reminder of the original
> > Diskless design:
> >
> >
> > The new approach could be depicted as the following:
> >
> >
> > As you can see, the main difference is that now instead of a single commit
> > request to the batch coordinator, we send multiple parallel commit requests
> > to all the leaders of each partition involved in the WAL file. Each of them
> > will commit its batches independently, without coordinating with other
> > leaders and any other components. Batch data is addressed by the WAL file
> > name, the byte offset and size, which allows partitions to know nothing
> > about other partitions to access their data in shared WAL files.
> >
> > The number of partitions involved in a single WAL file may be quite large,
> > e.g. a hundred. A hundred network requests to commit one WAL file is very
> > impractical. However, there are ways to reduce this number:
> >  1. Partition leaders are located on brokers. Requests to leaders on one
> > broker could be grouped together into a single physical network request
> > (resembling the normal Produce request that may carry batches for many
> > partitions inside). This will cap the number of network requests to the
> > number of brokers in the cluster.
> >  2. If we craft the cluster metadata to make producers send their requests
> > to the right brokers (with respect to AZs), we may achieve the higher
> > concentration of logical commit requests in physical network requests
> > reducing the number of the latter ones even further, ideally to one.
> >
> > Obviously, out of multiple commit requests some may fail or time out for a
> > variety of reasons. This is fine. Some producers will receive totally or
> > partially failed responses to their Produce requests, similar to what they
> > would have received when appending to a classic topic fails or times out.
> > If a partition experiences problems, other partitions will not be affected
> > (again, like in classic topics). Of course, the uncommitted data will be
> > garbage in WAL files. But WAL files are short-lived (batches are constantly
> > assembled into segments and offloaded to tiered storage), so this garbage
> > will be eventually deleted.
> >
> > For safely deleting WAL files we now need to centrally manage them, as
> > this is the only state and logic that spans multiple partitions. On the
> > diagram, you can see another commit request called “Commit file (best
> > effort)” going to the WAL File Manager. This manager will be responsible
> > for the following:
> >  1. Collecting (by requests from brokers) and persisting information about
> > committed WAL files.
> >  2. To handle potential failures in file information delivery, it will be
> > doing prefix scan on the remote storage periodically to find and register
> > unknown files. The period of this scan will be configurable and ideally
> > should be quite long.
> >  3. Checking with the relevant partition leaders (after a grace period) if
> > they still have batches in a particular file.
> >  4. Physically deleting files when they aren’t anymore referred to by any
> > partition.
> >
> > This new design offers the following advantages:
> >  1. It simplifies the implementation of many Kafka features such as
> > idempotence, transactions, queues, tiered storage, retention. Now we don’t
> > need to abstract away and reuse the code from partition leaders in the
> > batch coordinator. Instead, we will literally use the same code paths in
> > leaders, with little adaptation. Workflows from classic topics mostly
> > remain unchanged.
> > For example, it seems that
> > ReplicaManager.maybeSendPartitionsToTransactionCoordinator  and
> > KafkaApis.handleWriteTxnMarkersRequest used for transaction support on the
> > partition leader side could be used for diskless topics with little
> > adaptation. ProducerStateManager, needed for both idempotent produce and
> > transactions, would be reused.
> > Another example is share groups support, where the share partition leader,
> > being co-located with the partition leader, would execute the same logic
> > for both diskless and classic topics.
> >  2. It returns to the familiar partition-based scaling model, where
> > partitions are independent.
> >  3. It makes the operation and failure patterns closer to the familiar
> > ones from classic topics.
> >  4. It opens a straightforward path to seamless switching the topics modes
> > between diskless and classic.
> >
> > The rest of the things remain unchanged compared to the previous Diskless
> > design (after all previous discussions). Such things as local segment
> > materialization by replicas, the consume path, tiered storage integration,
> > etc.
> >
> > If the community finds this design more suitable, we will update the
> > KIP(s) accordingly and continue working on it. Please let us know what you
> > think.
> >
> > Best regards,
> > Ivan and Diskless team
> >
> > On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote:
> > > Hi Justine,
> > >
> > > Yes, you're right. We need to track the aborted transactions for in the
> > diskless coordinator for as long as the corresponding offsets are there.
> > With the tiered storage unification Greg mentioned earlier, this will be
> > finite time even for infinite data retention.
> > >
> > > Best,
> > > Ivan
> > >
> > > On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote:
> > > > Hey Ivan,
> > > >
> > > > Thanks for the response. I think most of what you said made sense, but
> > I
> > > > did have some questions about this part:
> > > >
> > > > > As we understand this, the partition leader in classic topics forgets
> > > > about a transaction once it’s replicated (HWM overpasses it). The
> > > > transaction coordinator acts like the main guardian, allowing partition
> > > > leaders to do this safely. Please correct me if this is wrong. We think
> > > > about relying on this with the batch coordinator and delete the
> > information
> > > > about a transaction once it’s finished (as there’s no replication and
> > HWM
> > > > advances immediately).
> > > >
> > > > I didn't quite understand this. In classic topics, we have maps for
> > ongoing
> > > > transactions which remove state when the transaction is completed and
> > an
> > > > aborted transactions index which is retained for much longer. Once the
> > > > transaction is completed, the coordinator is no longer involved in
> > > > maintaining this partition side state, and it is subject to compaction
> > etc.
> > > > Looking back at the outline provided above, I didn't see much about the
> > > > fetch path, so maybe that could be expanded a bit further. I saw the
> > > > following in a response:
> > > > > When the broker constructs a fully valid local segment, all the
> > necessary
> > > > control batches will be inserted and indices, including the transaction
> > > > index will be built to serve FetchRequests exactly as they are today.
> > > >
> > > > Based on this, it seems like we need to retain the information about
> > > > aborted txns for longer.
> > > >
> > > > Thanks,
> > > > Justine
> > > >
> > > > On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko <[email protected]> wrote:
> > > >
> > > > > Hi Justine and all,
> > > > >
> > > > > Thank you for your questions!
> > > > >
> > > > > > JO 1. >Since a transaction could be uniquely identified with
> > producer ID
> > > > > > and epoch, the positive result of this check could be cached
> > locally
> > > > > > Are we saying that only new transaction version 2 transactions can
> > be
> > > > > used
> > > > > > here? If not, we can't uniquely identify transactions with
> > producer id +
> > > > > > epoch
> > > > >
> > > > > You’re right that we (probably unintentionally) focused only on
> > version 2.
> > > > > We can either limit the support to version 2 or consider using some
> > > > > surrogates to support version 1.
> > > > >
> > > > > > JO 2. >The batch coordinator does the final transactional checks
> > of the
> > > > > > batches. This procedure would output the same errors like the
> > partition
> > > > > > leader in classic topics would do.
> > > > > > Can you expand on what these checks are? Would you be checking if
> > the
> > > > > > transaction was still ongoing for example?* *
> > > > >
> > > > > Yes, the producer epoch, that the transaction is ongoing, and of
> > course
> > > > > the normal idempotence checks. What the partition leader in the
> > classic
> > > > > topics does before appending a batch to the local log (e.g. in
> > > > > UnifiedLog.maybeStartTransactionVerification and
> > > > > UnifiedLog.analyzeAndValidateProducerState). In Diskless, we
> > unfortunately
> > > > > cannot do these checks before appending the data to the WAL segment
> > and
> > > > > uploading it, but we can “tombstone” these batches in the batch
> > coordinator
> > > > > during the final commit.
> > > > >
> > > > > > Is there state about ongoing
> > > > > > transactions in the batch coordinator? I see some other state
> > mentioned
> > > > > in
> > > > > > the End transaction section, but it's not super clear what state is
> > > > > stored
> > > > > > and when it is stored.
> > > > >
> > > > > Right, this should have been more explicit. As the partition leader
> > tracks
> > > > > ongoing transactions for classic topics, the batch coordinator has
> > to as
> > > > > well. So when a transaction starts and ends, the transaction
> > coordinator
> > > > > must inform the batch coordinator about this.
> > > > >
> > > > > > JO 3. I didn't see anything about maintaining LSO -- perhaps that
> > would
> > > > > be
> > > > > > stored in the batch coordinator?
> > > > >
> > > > > Yes. This could be deduced from the committed batches and other
> > > > > information, but for the sake of performance we’d better store it
> > > > > explicitly.
> > > > >
> > > > > > JO 4. Are there any thoughts about how long transactional state is
> > > > > > maintained in the batch coordinator and how it will be cleaned up?
> > > > >
> > > > > As we understand this, the partition leader in classic topics forgets
> > > > > about a transaction once it’s replicated (HWM overpasses it). The
> > > > > transaction coordinator acts like the main guardian, allowing
> > partition
> > > > > leaders to do this safely. Please correct me if this is wrong. We
> > think
> > > > > about relying on this with the batch coordinator and delete the
> > information
> > > > > about a transaction once it’s finished (as there’s no replication
> > and HWM
> > > > > advances immediately).
> > > > >
> > > > > Best,
> > > > > Ivan
> > > > >
> > > > > On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
> > > > > > Hey folks,
> > > > > >
> > > > > > Excited to see some updates related to transactions!
> > > > > >
> > > > > > I had a few questions.
> > > > > >
> > > > > > JO 1. >Since a transaction could be uniquely identified with
> > producer ID
> > > > > > and epoch, the positive result of this check could be cached
> > locally
> > > > > > Are we saying that only new transaction version 2 transactions can
> > be
> > > > > used
> > > > > > here? If not, we can't uniquely identify transactions with
> > producer id +
> > > > > > epoch
> > > > > >
> > > > > > JO 2. >The batch coordinator does the final transactional checks
> > of the
> > > > > > batches. This procedure would output the same errors like the
> > partition
> > > > > > leader in classic topics would do.
> > > > > > Can you expand on what these checks are? Would you be checking if
> > the
> > > > > > transaction was still ongoing for example? Is there state about
> > ongoing
> > > > > > transactions in the batch coordinator? I see some other state
> > mentioned
> > > > > in
> > > > > > the End transaction section, but it's not super clear what state is
> > > > > stored
> > > > > > and when it is stored.
> > > > > >
> > > > > > JO 3. I didn't see anything about maintaining LSO -- perhaps that
> > would
> > > > > be
> > > > > > stored in the batch coordinator?
> > > > > >
> > > > > > JO 4. Are there any thoughts about how long transactional state is
> > > > > > maintained in the batch coordinator and how it will be cleaned up?
> > > > > >
> > > > > > On Mon, Sep 8, 2025 at 10:38 AM Jun Rao <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > Hi, Greg and Ivan,
> > > > > > >
> > > > > > > Thanks for the update. A few comments.
> > > > > > >
> > > > > > > JR 10. "Consumer fetches are now served from local segments,
> > making
> > > > > use of
> > > > > > > the
> > > > > > > indexes, page cache, request purgatory, and zero-copy
> > functionality
> > > > > already
> > > > > > > built into classic topics."
> > > > > > > JR 10.1 Does the broker build the producer state for each
> > partition in
> > > > > > > diskless topics?
> > > > > > > JR 10.2 For transactional data, the consumer fetches need to know
> > > > > aborted
> > > > > > > records. How is that achieved?
> > > > > > >
> > > > > > > JR 11. "The batch coordinator saves that the transaction is
> > finished
> > > > > and
> > > > > > > also inserts the control batches in the corresponding logs of the
> > > > > involved
> > > > > > > Diskless topics. This happens only on the metadata level, no
> > actual
> > > > > control
> > > > > > > batches are written to any file. "
> > > > > > > A fetch response could include multiple transactional batches.
> > How
> > > > > does the
> > > > > > > broker obtain the information about the ending control batch for
> > each
> > > > > > > batch? Does that mean that a fetch response needs to be built by
> > > > > > > stitching record batches and generated control batches together?
> > > > > > >
> > > > > > > JR 12. Queues: Is there still a share partition leader that all
> > > > > consumers
> > > > > > > are routed to?
> > > > > > >
> > > > > > > JR 13. "Should the KIPs be modified to include this or it's too
> > > > > > > implementation-focused?" It would be useful to include enough
> > details
> > > > > to
> > > > > > > understand correctness and performance impact.
> > > > > > >
> > > > > > > HC5. Henry has a valid point. Requests from a given producer
> > contain a
> > > > > > > sequence number, which is ordered. If a producer sends every
> > Produce
> > > > > > > request to an arbitrary broker, those requests could reach the
> > batch
> > > > > > > coordinator in different order and lead to rejection of the
> > produce
> > > > > > > requests.
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > We have also thought in a bit more details about transactions
> > and
> > > > > queues,
> > > > > > > > here's the plan.
> > > > > > > >
> > > > > > > > *Transactions*
> > > > > > > >
> > > > > > > > The support for transactions in *classic topics* is based on
> > precise
> > > > > > > > interactions between three actors: clients (mostly producers,
> > but
> > > > > also
> > > > > > > > consumers), brokers (ReplicaManager and other classes), and
> > > > > transaction
> > > > > > > > coordinators. Brokers also run partition leaders with their
> > local
> > > > > state
> > > > > > > > (ProducerStateManager and others).
> > > > > > > >
> > > > > > > > The high level (some details skipped) workflow is the
> > following.
> > > > > When a
> > > > > > > > transactional Produce request is received by the broker:
> > > > > > > > 1. For each partition, the partition leader checks if a
> > non-empty
> > > > > > > > transaction is running for this partition. This is done using
> > its
> > > > > local
> > > > > > > > state derived from the log metadata (ProducerStateManager,
> > > > > > > > VerificationStateEntry, VerificationGuard).
> > > > > > > > 2. The transaction coordinator is informed about all the
> > partitions
> > > > > that
> > > > > > > > aren’t part of the transaction to include them.
> > > > > > > > 3. The partition leaders do additional transactional checks.
> > > > > > > > 4. The partition leaders append the transactional data to
> > their logs
> > > > > and
> > > > > > > > update some of their state (for example, log the fact that the
> > > > > > > transaction
> > > > > > > > is running for the partition and its first offset).
> > > > > > > >
> > > > > > > > When the transaction is committed or aborted:
> > > > > > > > 1. The producer contacts the transaction coordinator directly
> > with
> > > > > > > > EndTxnRequest.
> > > > > > > > 2. The transaction coordinator writes PREPARE_COMMIT or
> > > > > PREPARE_ABORT to
> > > > > > > > its log and responds to the producer.
> > > > > > > > 3. The transaction coordinator sends WriteTxnMarkersRequest to
> > the
> > > > > > > leaders
> > > > > > > > of the involved partitions.
> > > > > > > > 4. The partition leaders write the transaction markers to
> > their logs
> > > > > and
> > > > > > > > respond to the coordinator.
> > > > > > > > 5. The coordinator writes the final transaction state
> > > > > COMPLETE_COMMIT or
> > > > > > > > COMPLETE_ABORT.
> > > > > > > >
> > > > > > > > In classic topics, partitions have leaders and lots of
> > important
> > > > > state
> > > > > > > > necessary for supporting this workflow is local. The main
> > challenge
> > > > > in
> > > > > > > > mapping this to Diskless comes from the fact there are no
> > partition
> > > > > > > > leaders, so the corresponding pieces of state need to be
> > globalized
> > > > > in
> > > > > > > the
> > > > > > > > batch coordinator. We are already doing this to support
> > idempotent
> > > > > > > produce.
> > > > > > > >
> > > > > > > > The high level workflow for *diskless topics* would look very
> > > > > similar:
> > > > > > > > 1. For each partition, the broker checks if a non-empty
> > transaction
> > > > > is
> > > > > > > > running for this partition. In contrast to classic topics,
> > this is
> > > > > > > checked
> > > > > > > > against the batch coordinator with a single RPC. Since a
> > transaction
> > > > > > > could
> > > > > > > > be uniquely identified with producer ID and epoch, the positive
> > > > > result of
> > > > > > > > this check could be cached locally (for the double configured
> > > > > duration
> > > > > > > of a
> > > > > > > > transaction, for example).
> > > > > > > > 2. The same: The transaction coordinator is informed about all
> > the
> > > > > > > > partitions that aren’t part of the transaction to include them.
> > > > > > > > 3. No transactional checks are done on the broker side.
> > > > > > > > 4. The broker appends the transactional data to the current
> > shared
> > > > > WAL
> > > > > > > > segment. It doesn’t update any transaction-related state for
> > Diskless
> > > > > > > > topics, because it doesn’t have any.
> > > > > > > > 5. The WAL segment is committed to the batch coordinator like
> > in the
> > > > > > > > normal produce flow.
> > > > > > > > 6. The batch coordinator does the final transactional checks
> > of the
> > > > > > > > batches. This procedure would output the same errors like the
> > > > > partition
> > > > > > > > leader in classic topics would do. I.e. some batches could be
> > > > > rejected.
> > > > > > > > This means, there will potentially be garbage in the WAL
> > segment
> > > > > file in
> > > > > > > > case of transactional errors. This is preferable to doing more
> > > > > network
> > > > > > > > round trips, especially considering the WAL segments will be
> > > > > relatively
> > > > > > > > short-living (see the Greg's update above).
> > > > > > > >
> > > > > > > > When the transaction is committed or aborted:
> > > > > > > > 1. The producer contacts the transaction coordinator directly
> > with
> > > > > > > > EndTxnRequest.
> > > > > > > > 2. The transaction coordinator writes PREPARE_COMMIT or
> > > > > PREPARE_ABORT to
> > > > > > > > its log and responds to the producer.
> > > > > > > > 3. *[NEW]* The transaction coordinator informs the batch
> > coordinator
> > > > > that
> > > > > > > > the transaction is finished.
> > > > > > > > 4. *[NEW]* The batch coordinator saves that the transaction is
> > > > > finished
> > > > > > > > and also inserts the control batches in the corresponding logs
> > of the
> > > > > > > > involved Diskless topics. This happens only on the metadata
> > level, no
> > > > > > > > actual control batches are written to any file. They will be
> > > > > dynamically
> > > > > > > > created on Fetch and other read operations. We could
> > technically
> > > > > write
> > > > > > > > these control batches for real, but this would mean extra
> > produce
> > > > > > > latency,
> > > > > > > > so it's better just to mark them in the batch coordinator and
> > save
> > > > > these
> > > > > > > > milliseconds.
> > > > > > > > 5. The transaction coordinator sends WriteTxnMarkersRequest to
> > the
> > > > > > > leaders
> > > > > > > > of the involved partitions. – Now only to classic topics now.
> > > > > > > > 6. The partition leaders of classic topics write the
> > transaction
> > > > > markers
> > > > > > > > to their logs and respond to the coordinator.
> > > > > > > > 7. The coordinator writes the final transaction state
> > > > > COMPLETE_COMMIT or
> > > > > > > > COMPLETE_ABORT.
> > > > > > > >
> > > > > > > > Compared to the non-transactional produce flow, we get:
> > > > > > > > 1. An extra network round trip between brokers and the batch
> > > > > coordinator
> > > > > > > > when a new partition appear in the transaction. To mitigate the
> > > > > impact of
> > > > > > > > them:
> > > > > > > >   - The results will be cached.
> > > > > > > >   - The calls for multiple partitions in one Produce request
> > will be
> > > > > > > > grouped.
> > > > > > > >   - The batch coordinator should be optimized for fast
> > response to
> > > > > these
> > > > > > > > RPCs.
> > > > > > > >   - The fact that a single producer normally will communicate
> > with a
> > > > > > > > single broker for the duration of the transaction further
> > reduces the
> > > > > > > > expected number of round trips.
> > > > > > > > 2. An extra round trip between the transaction coordinator and
> > batch
> > > > > > > > coordinator when a transaction is finished.
> > > > > > > >
> > > > > > > > With this proposal, transactions will also be able to span both
> > > > > classic
> > > > > > > > and Diskless topics.
> > > > > > > >
> > > > > > > > *Queues*
> > > > > > > >
> > > > > > > > The share group coordination and management is a side job that
> > > > > doesn't
> > > > > > > > interfere with the topic itself (leadership, replicas, physical
> > > > > storage
> > > > > > > of
> > > > > > > > records, etc.) and non-queue producers and consumers (Fetch and
> > > > > Produce
> > > > > > > > RPCs, consumer group-related RPCs are not affected.) We don't
> > see any
> > > > > > > > reason why we can't make Diskless topics compatible with share
> > > > > groups the
> > > > > > > > same way as classic topics are. Even on the code level, we
> > don't
> > > > > expect
> > > > > > > any
> > > > > > > > serious refactoring: the same reading routines are used that
> > are
> > > > > used for
> > > > > > > > fetching (e.g. ReplicaManager.readFromLog).
> > > > > > > >
> > > > > > > >
> > > > > > > > Should the KIPs be modified to include this or it's too
> > > > > > > > implementation-focused?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Ivan
> > > > > > > >
> > > > > > > > On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Thank you all for your questions and design input on
> > KIP-1150.
> > > > > > > > >
> > > > > > > > > We have just updated KIP-1150 and KIP-1163 with a new
> > design. To
> > > > > > > > summarize
> > > > > > > > > the changes:
> > > > > > > > >
> > > > > > > > > 1. The design prioritizes integrating with the existing
> > KIP-405
> > > > > Tiered
> > > > > > > > > Storage interfaces, permitting data produced to a Diskless
> > topic
> > > > > to be
> > > > > > > > > moved to tiered storage.
> > > > > > > > > This lowers the scalability requirements for the Batch
> > Coordinator
> > > > > > > > > component, and allows Diskless to compose with Tiered Storage
> > > > > plugin
> > > > > > > > > features such as encryption and alternative data formats.
> > > > > > > > >
> > > > > > > > > 2. Consumer fetches are now served from local segments,
> > making use
> > > > > of
> > > > > > > the
> > > > > > > > > indexes, page cache, request purgatory, and zero-copy
> > functionality
> > > > > > > > already
> > > > > > > > > built into classic topics.
> > > > > > > > > However, local segments are now considered cache elements,
> > do not
> > > > > need
> > > > > > > to
> > > > > > > > > be durably stored, and can be built without contacting any
> > other
> > > > > > > > replicas.
> > > > > > > > >
> > > > > > > > > 3. The design has been simplified substantially, by removing
> > the
> > > > > > > previous
> > > > > > > > > Diskless consume flow, distributed cache component, and
> > "object
> > > > > > > > > compaction/merging" step.
> > > > > > > > >
> > > > > > > > > The design maintains leaderless produces as enabled by the
> > Batch
> > > > > > > > > Coordinator, and the same latency profiles as the earlier
> > design,
> > > > > while
> > > > > > > > > being simpler and integrating better into the existing
> > ecosystem.
> > > > > > > > >
> > > > > > > > > Thanks, and we are eager to hear your feedback on the new
> > design.
> > > > > > > > > Greg Harris
> > > > > > > > >
> > > > > > > > > On Mon, Jul 21, 2025 at 3:30 PM Jun Rao
> > <[email protected]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Jan,
> > > > > > > > > >
> > > > > > > > > > For me, the main gap of KIP-1150 is the support of all
> > existing
> > > > > > > client
> > > > > > > > > > APIs. Currently, there is no design for supporting APIs
> > like
> > > > > > > > transactions
> > > > > > > > > > and queues.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > > Would it be a good time to ask for the current status of
> > this
> > > > > KIP?
> > > > > > > I
> > > > > > > > > > > haven't seen much activity here for the past 2 months,
> > the
> > > > > vote got
> > > > > > > > > > vetoed
> > > > > > > > > > > but I think the pending questions have been answered
> > since
> > > > > then.
> > > > > > > > KIP-1183
> > > > > > > > > > > (AutoMQ's proposal) also didn't have any activity since
> > May.
> > > > > > > > > > >
> > > > > > > > > > > In my eyes KIP-1150 and KIP-1183 are two real choices
> > that can
> > > > > be
> > > > > > > > > > > made, with a coordinator-based approach being by far the
> > > > > dominant
> > > > > > > one
> > > > > > > > > > when
> > > > > > > > > > > it comes to market adoption - but all these are
> > standalone
> > > > > > > products.
> > > > > > > > > > >
> > > > > > > > > > > I'm a big fan of both approaches, but would hate to see a
> > > > > stall. So
> > > > > > > > the
> > > > > > > > > > > question is: can we get an update?
> > > > > > > > > > >
> > > > > > > > > > > Maybe it's time to start another vote? Colin McCabe -
> > have your
> > > > > > > > questions
> > > > > > > > > > > been answered? If not, is there anything I can do to
> > help? I'm
> > > > > > > deeply
> > > > > > > > > > > familiar with both architectures and have written about
> > both?
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > > Jan
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski <
> > > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I have some nits - it may be useful to
> > > > > > > > > > > >
> > > > > > > > > > > > a) group all the KIP email threads in the main one
> > (just a
> > > > > bunch
> > > > > > > of
> > > > > > > > > > links
> > > > > > > > > > > > to everything)
> > > > > > > > > > > > b) create the email threads
> > > > > > > > > > > >
> > > > > > > > > > > > It's a bit hard to track it all - for example, I was
> > > > > searching
> > > > > > > for
> > > > > > > > a
> > > > > > > > > > > > discuss thread for KIP-1165 for a while; As far as I
> > can
> > > > > tell, it
> > > > > > > > > > doesn't
> > > > > > > > > > > > exist yet.
> > > > > > > > > > > >
> > > > > > > > > > > > Since the KIPs are published (by virtue of having the
> > root
> > > > > KIP be
> > > > > > > > > > > > published, having a DISCUSS thread and links to
> > sub-KIPs
> > > > > where
> > > > > > > were
> > > > > > > > > > aimed
> > > > > > > > > > > > to move the discussion towards), I think it would be
> > good to
> > > > > > > create
> > > > > > > > > > > DISCUSS
> > > > > > > > > > > > threads for them all.
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Stan
> > > > > > > > > > > >
> > > > > > > > > > > > On 2025/04/16 11:58:22 Josep Prat wrote:
> > > > > > > > > > > > > Hi Kafka Devs!
> > > > > > > > > > > > >
> > > > > > > > > > > > > We want to start a new KIP discussion about
> > introducing a
> > > > > new
> > > > > > > > type of
> > > > > > > > > > > > > topics that would make use of Object Storage as the
> > primary
> > > > > > > > source of
> > > > > > > > > > > > > storage. However, as this KIP is big we decided to
> > split it
> > > > > > > into
> > > > > > > > > > > multiple
> > > > > > > > > > > > > related KIPs.
> > > > > > > > > > > > > We have the motivational KIP-1150 (
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > > > > > > > > > > )
> > > > > > > > > > > > > that aims to discuss if Apache Kafka should aim to
> > have
> > > > > this
> > > > > > > > type of
> > > > > > > > > > > > > feature at all. This KIP doesn't go onto details on
> > how to
> > > > > > > > implement
> > > > > > > > > > > it.
> > > > > > > > > > > > > This follows the same approach used when we discussed
> > > > > KRaft.
> > > > > > > > > > > > >
> > > > > > > > > > > > > But as we know that it is sometimes really hard to
> > discuss
> > > > > on
> > > > > > > > that
> > > > > > > > > > meta
> > > > > > > > > > > > > level, we also created several sub-kips (linked in
> > > > > KIP-1150)
> > > > > > > that
> > > > > > > > > > offer
> > > > > > > > > > > > an
> > > > > > > > > > > > > implementation of this feature.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We kindly ask you to use the proper DISCUSS threads
> > for
> > > > > each
> > > > > > > > type of
> > > > > > > > > > > > > concern and keep this one to discuss whether Apache
> > Kafka
> > > > > wants
> > > > > > > > to
> > > > > > > > > > have
> > > > > > > > > > > > > this feature or not.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks in advance on behalf of all the authors of
> > this KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ------------------
> > > > > > > > > > > > > Josep Prat
> > > > > > > > > > > > > Open Source Engineering Director, Aiven
> > > > > > > > > > > > > [email protected]   |   +491715557497 | aiven.io
> > > > > > > > > > > > > Aiven Deutschland GmbH
> > > > > > > > > > > > > Alexanderufer 3-7, 10117 Berlin
> > > > > > > > > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > > > > > > > > > > > > Anna Richardson, Kenneth Chen
> > > > > > > > > > > > > Amtsgericht Charlottenburg, HRB 209739 B
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to