Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Nick Telford Tue, 21 Apr 2026 07:52:34 -0700

BB3b: Thinking a bit more about the cache vs. transaction buffer - what if
we make a subtle change to the cache: when enable.transactional.statestores
= true, we no longer cache writes (put/delete), only reads. Since the
transaction buffer will already be storing recently written records (at
least until commit), we don't need to keep them in another cache. We would
still cache records read *from* the transaction buffer (because we would
take the read of a recently written record as a signal that it's worth
caching). The only caveat to this would be that the transaction buffer
flushes recently written records on-commit, so cache hit-rate of recently
written records will dip during commits.


What do you think?

On Tue, 21 Apr 2026 at 10:39, Nick Telford <[email protected]> wrote:

> Hi Bill,
>
> Thanks for reviewing the updated KIP!
>
> BB1: I'm not quite sure I understand what you're saying here. We need to
> take a copy/snapshot of the read buffer because the read buffer is emptied
> on-commit, and any Iterator created on a ConcurrentSkipListMap
> automatically reflects those changes, so would terminate iteration
> prematurely. Also, any new writes to the buffer (put/delete) would be
> immediately reflected in the iterator, making its results inconsistent.
> TL;DR: copying it gives us full snapshot isolation for IQ iterators.
> FWIW, the implementation already takes an Iterator over the underlying
> store, as well as the buffer, and merges the two iterators together. But
> the read buffer's Iterator must be over a copy of the read buffer, to
> provide a consistent snapshot of the store.
>
> BB2: Basically, the RocksDBTransactionBuffer will maintain an extra "write
> buffer", which uses a RocksDB WriteBatch to stage writes during
> put/delete. This yields better throughput than iterating the
> ConcurrentSkipListMap and writing to a WriteBatch during commit, and
> reduces the time that IQ threads (that need a readLock on the
> ConcurrentSkipListMap) are blocked from creating iterators during commit.
>
> InMemoryTransactionBuffers do not need this extra "write buffer", as there
> is no more efficient implementation for them to use. This means that, vs.
> RocksDB, InMemoryStores will actually use less memory buffering records.
>
> BB3: This is a good question, and one that I've not really dug too deeply
> into. I suspect that there's a level of duplication of concerns here, but
> there are subtle differences: the cache is an LRU cache, which can cache
> records across multiple commits (if they're frequently read, for example),
> whereas the transaction buffer always exclusively contains uncommitted
> writes.
>
> In terms of correctness/consistency, I don't think we have a problem. For
> READ_COMMITTED IQ, we skip both the cache and transaction buffers. For
> READ_UNCOMMITTED IQ (and StreamThread reads) we read in the order dictated
> by the StateStore layering, which is: cache first, then transaction buffer,
> then underlying store. Since writes always update the cache, this is
> correct. The one wrinkle is that cached writes are only written to
> downstream store layers (and therefore buffered by the transaction buffer)
> when the entry is evicted, or the cache is flushed or committed. This
> essentially double-buffers writes, which is a bit wasteful of memory.
>
> I don't really know much about the cache, but as far as I can tell it
> might eliminate some (de)serialization overhead? Regardless, given that it
> implements a different behaviour (LRU vs. buffering uncommitted writes),
> eliminating it might cause some subtle performance regressions for *some*
> users.
>
> Maybe instead we have the *default* cache size change to 0 when
> enable.transactional.statestores = true and
> statestore.uncommitted.max.bytes > 0? That would reduce memory usage for
> users that aren't explicitly relying on the cache.
>
> BB4: Ahh yes, good catch! This is showing how old this KIP is :-)
>
> Regards,
> Nick
>
> On Mon, 20 Apr 2026 at 21:11, Bill Bejeck <[email protected]> wrote:
>
>> Hi Nick,
>>
>> Thanks for the update to the KIP!  It's great to see this moving again.
>> Overall the KIP updates LGTM, but I do have a couple of comments
>>
>> BB1- The KIP specificies using a ConcurrentSkipListMap for the
>> transactional buffer and for IQ READ_UNCOMMITTED queries KS will take a
>> snapshot and during the copy lock the StreamThread with
>> RenetrantReadWriteLock.
>> Now if the IQ queries will only be served from the in-memory transactional
>> buffer, that's an acceptable trade-off since even though the
>> ConcurrentSkipListMap uses weakly consistent iterators, a commit
>> mid-iteration would mean those records not viewed as a result of clearing
>> the buffer.  Would it be worth exploring using IQ against the store and
>> the
>> buffer thus removing the copy+locking?  I think this could be
>> worthwhile (unless just technically unfeasible) as the way it stands users
>> will have to possibly make tradeoffs in performance by adjusting either
>> the
>> commit interval or the size of uncommitted bytes. Of course as specified
>> in
>> the KIP they have the option of going with READ_COMMITTED.
>>
>> BB2 - In the KIP section "Transaction Buffer Thread Safety" there's this
>> sentence which I find confusing:
>>
>> > Reads from the StreamThread and READ_UNCOMMITTED IQ threads are serviced
>> > by the read buffer. The write buffer is (optionally) used to buffer
>> > writes for commit to the database, if doing so would yield better
>> > performance than using the read buffer.
>>
>> Can you clarify what this means?
>>
>> BB3 - How does the `statestore.uncommitted.max.bytes` work in
>> conjuction with the `statestore.cache.max.bytes.buffering` config?  If
>> users elect to go with transactional stores is the statestore cache still
>> needed?  Since the statestore cache flushes on commit() as well as the
>> transactional cache what's the workflow like between the two?  The default
>> size of the store cache is 10M so if it becomes full and a commit is not
>> called  would it empty into the transactional cache?
>>
>> BB4- In discussing the `enable.transactional.statestores` config it
>> mentions "exactly-once, exactly-once-v2 or exactly-once-beta" but we can
>> go
>> with just `exactly-once-v2` , we removed the others since 4.0
>>
>> Thanks!
>> Bill
>>
>>
>> On Fri, Apr 17, 2026 at 1:13 PM Nick Telford <[email protected]>
>> wrote:
>>
>> > Hi everyone,
>> >
>> > We're circling back to KIP-892, with the goal of including it in 4.4. To
>> > that end, I've updated the KIP with substantial design improvements:
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892:+Transactional+Semantics+for+StateStores
>> >
>> > Highlights:
>> >
>> >    - TransactionBuffers that support both RocksDB and InMemory stores
>> >    - Support for custom IsolationLevel of Interactive Queries,
>> including a
>> >    default (IQv1 and IQv2) and a per-request isolation level (IQv2 only)
>> >    - New store-level metric for uncommitted-bytes
>> >
>> > Let me know what you think!
>> >
>> > Regards,
>> > Nick
>> >
>> > On Wed, 17 Apr 2024 at 11:50, Nick Telford <[email protected]>
>> wrote:
>> >
>> > > Hi Walker,
>> > >
>> > > Feel free to ask away, either on the mailing list of the Confluent
>> > > Community Slack, where I hang out :-)
>> > >
>> > > The implementation is *mostly* complete, although it needs some
>> > polishing.
>> > > It's worth noting that KIP-1035 is a hard prerequisite for this.
>> > >
>> > > Regards,
>> > > Nick
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Reply via email to