Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Colt McNealy Sun, 17 Sep 2023 21:57:23 -0700

> Making IsolationLevel a query-time constraint, rather than linking it to
the processing.guarantee.


As I understand it, would this allow even a user of EOS to control whether
reading committed or uncommitted records? If so, I am highly in favor of
this.

I know that I was one of the early people to point out the current
shortcoming that IQ reads uncommitted records, but just this morning I
realized a pattern we use which means that (for certain queries) our system
needs to be able to read uncommitted records, which is the current behavior
of Kafka Streams in EOS.***

If IsolationLevel being a query-time decision allows for this, then that
would be amazing. I would also vote that the default behavior should be for
reading uncommitted records, because it is totally possible for a valid
application to depend on that behavior, and breaking it in a minor release
might be a bit strong.

*** (Note, for the curious reader....) Our use-case/query pattern is a bit
complex, but reading "uncommitted" records is actually safe in our case
because processing is deterministic. Additionally, IQ being able to read
uncommitted records is crucial to enable "read your own writes" on our API:
Due to the deterministic processing, we send an "ack" to the client who
makes the request as soon as the processor processes the result. If they
can't read uncommitted records, they may receive a "201 - Created"
response, immediately followed by a "404 - Not Found" when doing a lookup
for the object they just created).

Thanks,
Colt McNealy

*Founder, LittleHorse.dev*


On Wed, Sep 13, 2023 at 9:19 AM Nick Telford <nick.telf...@gmail.com> wrote:

> Addendum:
>
> I think we would also face the same problem with the approach John outlined
> earlier (using the record cache as a transaction buffer and flushing it
> straight to SST files). This is because the record cache (the ThreadCache
> class) is not thread-safe, so every commit would invalidate open IQ
> Iterators in the same way that RocksDB WriteBatches do.
> --
> Nick
>
> On Wed, 13 Sept 2023 at 16:58, Nick Telford <nick.telf...@gmail.com>
> wrote:
>
> > Hi Bruno,
> >
> > I've updated the KIP based on our conversation. The only things I've not
> > yet done are:
> >
> > 1. Using transactions under ALOS and EOS.
> > 2. Making IsolationLevel a query-time constraint, rather than linking it
> > to the processing.guarantee.
> >
> > There's a wrinkle that makes this a challenge: Interactive Queries that
> > open an Iterator, when using transactions and READ_UNCOMMITTED.
> > The problem is that under READ_UNCOMMITTED, queries need to be able to
> > read records from the currently uncommitted transaction buffer
> > (WriteBatch). This includes for Iterators, which should iterate both the
> > transaction buffer and underlying database (using
> > WriteBatch#iteratorWithBase()).
> >
> > The issue is that when the StreamThread commits, it writes the current
> > WriteBatch to RocksDB *and then clears the WriteBatch*. Clearing the
> > WriteBatch while an Interactive Query holds an open Iterator on it will
> > invalidate the Iterator. Worse, it turns out that Iterators over a
> > WriteBatch become invalidated not just when the WriteBatch is cleared,
> but
> > also when the Iterators' current key receives a new write.
> >
> > Now that I'm writing this, I remember that this is the major reason that
> I
> > switched the original design from having a query-time IsolationLevel to
> > having the IsolationLevel linked to the transactionality of the stores
> > themselves.
> >
> > It *might* be possible to resolve this, by having a "chain" of
> > WriteBatches, with the StreamThread switching to a new WriteBatch
> whenever
> > a new Interactive Query attempts to read from the database, but that
> could
> > cause some performance problems/memory pressure when subjected to a high
> > Interactive Query load. It would also reduce the efficiency of
> WriteBatches
> > on-commit, as we'd have to write N WriteBatches, where N is the number of
> > Interactive Queries since the last commit.
> >
> > I realise this is getting into the weeds of the implementation, and you'd
> > rather we focus on the API for now, but I think it's important to
> consider
> > how to implement the desired API, in case we come up with an API that
> > cannot be implemented efficiently, or even at all!
> >
> > Thoughts?
> > --
> > Nick
> >
> > On Wed, 13 Sept 2023 at 13:03, Bruno Cadonna <cado...@apache.org> wrote:
> >
> >> Hi Nick,
> >>
> >> 6.
> >> Of course, you are right! My bad!
> >> Wiping out the state in the downgrading case is fine.
> >>
> >>
> >> 3a.
> >> Focus on the public facing changes for the KIP. We will manage to get
> >> the internals right. Regarding state stores that do not support
> >> READ_COMMITTED, they should throw an error stating that they do not
> >> support READ_COMMITTED. No need to adapt all state stores immediately.
> >>
> >> 3b.
> >> I am in favor of using transactions also for ALOS.
> >>
> >>
> >> Best,
> >> Bruno
> >>
> >> On 9/13/23 11:57 AM, Nick Telford wrote:
> >> > Hi Bruno,
> >> >
> >> > Thanks for getting back to me!
> >> >
> >> > 2.
> >> > The fact that implementations can always track estimated memory usage
> in
> >> > the wrapper is a good point. I can remove -1 as an option, and I'll
> >> clarify
> >> > the JavaDoc that 0 is not just for non-transactional stores, which is
> >> > currently misleading.
> >> >
> >> > 6.
> >> > The problem with catching the exception in the downgrade process is
> that
> >> > would require new code in the Kafka version being downgraded to. Since
> >> > users could conceivably downgrade to almost *any* older version of
> Kafka
> >> > Streams, I'm not sure how we could add that code?
> >> > The only way I can think of doing it would be to provide a dedicated
> >> > downgrade tool, that goes through every local store and removes the
> >> > offsets column families. But that seems like an unnecessary amount of
> >> extra
> >> > code to maintain just to handle a somewhat niche situation, when the
> >> > alternative (automatically wipe and restore stores) should be
> >> acceptable.
> >> >
> >> > 1, 4, 5: Agreed. I'll make the changes you've requested.
> >> >
> >> > 3a.
> >> > I agree that IsolationLevel makes more sense at query-time, and I
> >> actually
> >> > initially attempted to place the IsolationLevel at query-time, but I
> ran
> >> > into some problems:
> >> > - The key issue is that, under ALOS we're not staging writes in
> >> > transactions, so can't perform writes at the READ_COMMITTED isolation
> >> > level. However, this may be addressed if we decide to *always* use
> >> > transactions as discussed under 3b.
> >> > - IQv1 and IQv2 have quite different implementations. I remember
> having
> >> > some difficulty understanding the IQv1 internals, which made it
> >> difficult
> >> > to determine what needed to be changed. However, I *think* this can be
> >> > addressed for both implementations by wrapping the RocksDBStore in an
> >> > IsolationLevel-dependent wrapper, that overrides read methods (get,
> >> etc.)
> >> > to either read directly from the database or from the ongoing
> >> transaction.
> >> > But IQv1 might still be difficult.
> >> > - If IsolationLevel becomes a query constraint, then all other
> >> StateStores
> >> > will need to respect it, including the in-memory stores. This would
> >> require
> >> > us to adapt in-memory stores to stage their writes so they can be
> >> isolated
> >> > from READ_COMMITTTED queries. It would also become an important
> >> > consideration for third-party stores on upgrade, as without changes,
> >> they
> >> > would not support READ_COMMITTED queries correctly.
> >> >
> >> > Ultimately, I may need some help making the necessary change to IQv1
> to
> >> > support this, but I don't think it's fundamentally impossible, if we
> >> want
> >> > to pursue this route.
> >> >
> >> > 3b.
> >> > The main reason I chose to keep ALOS un-transactional was to minimize
> >> > behavioural change for most users (I believe most Streams users use
> the
> >> > default configuration, which is ALOS). That said, it's clear that if
> >> ALOS
> >> > also used transactional stores, the only change in behaviour would be
> >> that
> >> > it would become *more correct*, which could be considered a "bug fix"
> by
> >> > users, rather than a change they need to handle.
> >> >
> >> > I believe that performance using transactions (aka. RocksDB
> >> WriteBatches)
> >> > should actually be *better* than the un-batched write-path that is
> >> > currently used[1]. The only "performance" consideration will be the
> >> > increased memory usage that transactions require. Given the
> mitigations
> >> for
> >> > this memory that we have in place, I would expect that this is not a
> >> > problem for most users.
> >> >
> >> > If we're happy to do so, we can make ALOS also use transactions.
> >> >
> >> > Regards,
> >> > Nick
> >> >
> >> > Link 1:
> >> >
> https://github.com/adamretter/rocksjava-write-methods-benchmark#results
> >> >
> >> > On Wed, 13 Sept 2023 at 09:41, Bruno Cadonna <cado...@apache.org>
> >> wrote:
> >> >
> >> >> Hi Nick,
> >> >>
> >> >> Thanks for the updates and sorry for the delay on my side!
> >> >>
> >> >>
> >> >> 1.
> >> >> Making the default implementation for flush() a no-op sounds good to
> >> me.
> >> >>
> >> >>
> >> >> 2.
> >> >> I think what was bugging me here is that a third-party state store
> >> needs
> >> >> to implement the state store interface. That means they need to
> >> >> implement a wrapper around the actual state store as we do for
> RocksDB
> >> >> with RocksDBStore. So, a third-party state store can always estimate
> >> the
> >> >> uncommitted bytes, if it wants, because the wrapper can record the
> >> added
> >> >> bytes.
> >> >> One case I can think of where returning -1 makes sense is when
> Streams
> >> >> does not need to estimate the size of the write batch and trigger
> >> >> extraordinary commits, because the third-party state store takes care
> >> of
> >> >> memory. But in that case the method could also just return 0. Even
> that
> >> >> case would be better solved with a method that returns whether the
> >> state
> >> >> store manages itself the memory used for uncommitted bytes or not.
> >> >> Said that, I am fine with keeping the -1 return value, I was just
> >> >> wondering when and if it will be used.
> >> >>
> >> >> Regarding returning 0 for transactional state stores when the batch
> is
> >> >> empty, I was just wondering because you explicitly stated
> >> >>
> >> >> "or {@code 0} if this StateStore does not support transactions."
> >> >>
> >> >> So it seemed to me returning 0 could only happen for
> non-transactional
> >> >> state stores.
> >> >>
> >> >>
> >> >> 3.
> >> >>
> >> >> a) What do you think if we move the isolation level to IQ (v1 and
> v2)?
> >> >> In the end this is the only component that really needs to specify
> the
> >> >> isolation level. It is similar to the Kafka consumer that can choose
> >> >> with what isolation level to read the input topic.
> >> >> For IQv1 the isolation level should go into StoreQueryParameters. For
> >> >> IQv2, I would add it to the Query interface.
> >> >>
> >> >> b) Point a) raises the question what should happen during
> at-least-once
> >> >> processing when the state store does not use transactions? John in
> the
> >> >> past proposed to also use transactions on state stores for
> >> >> at-least-once. I like that idea, because it avoids aggregating the
> same
> >> >> records over and over again in the case of a failure. We had a case
> in
> >> >> the past where a Streams applications in at-least-once mode was
> failing
> >> >> continuously for some reasons I do not remember before committing the
> >> >> offsets. After each failover, the app aggregated again and again the
> >> >> same records. Of course the aggregate increased to very wrong values
> >> >> just because of the failover. With transactions on the state stores
> we
> >> >> could have avoided this. The app would have output the same aggregate
> >> >> multiple times (i.e., after each failover) but at least the value of
> >> the
> >> >> aggregate would not depend on the number of failovers. Outputting the
> >> >> same aggregate multiple times would be incorrect under exactly-once
> but
> >> >> it is OK for at-least-once.
> >> >> If it makes sense to add a config to turn on and off transactions on
> >> >> state stores under at-least-once or just use transactions in any case
> >> is
> >> >> a question we should also discuss in this KIP. It depends a bit on
> the
> >> >> performance trade-off. Maybe to be safe, I would add a config.
> >> >>
> >> >>
> >> >> 4.
> >> >> Your points are all valid. I tend to say to keep the metrics around
> >> >> flush() until we remove flush() completely from the interface. Calls
> to
> >> >> flush() might still exist since existing processors might still call
> >> >> flush() explicitly as you mentioned in 1). For sure, we need to
> >> document
> >> >> how the metrics change due to the transactions in the upgrade notes.
> >> >>
> >> >>
> >> >> 5.
> >> >> I see. Then you should describe how the .position files are handled
> in
> >> >> a dedicated section of the KIP or incorporate the description in the
> >> >> "Atomic Checkpointing" section instead of only mentioning it in the
> >> >> "Compatibility, Deprecation, and Migration Plan".
> >> >>
> >> >>
> >> >> 6.
> >> >> Describing upgrading and downgrading in the KIP is a good idea.
> >> >> Regarding downgrading, I think you could also catch the exception and
> >> do
> >> >> what is needed to downgrade, e.g., drop the column family. See here
> for
> >> >> an example:
> >> >>
> >> >>
> >> >>
> >>
> https://github.com/apache/kafka/blob/63fee01366e6ce98b9dfafd279a45d40b80e282d/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBTimestampedStore.java#L75
> >> >>
> >> >> It is a bit brittle, but it works.
> >> >>
> >> >>
> >> >> Best,
> >> >> Bruno
> >> >>
> >> >>
> >> >> On 8/24/23 12:18 PM, Nick Telford wrote:
> >> >>> Hi Bruno,
> >> >>>
> >> >>> Thanks for taking the time to review the KIP. I'm back from leave
> now
> >> and
> >> >>> intend to move this forwards as quickly as I can.
> >> >>>
> >> >>> Addressing your points:
> >> >>>
> >> >>> 1.
> >> >>> Because flush() is part of the StateStore API, it's exposed to
> custom
> >> >>> Processors, which might be making calls to flush(). This was
> actually
> >> the
> >> >>> case in a few integration tests.
> >> >>> To maintain as much compatibility as possible, I'd prefer not to
> make
> >> >> this
> >> >>> an UnsupportedOperationException, as it will cause previously
> working
> >> >>> Processors to start throwing exceptions at runtime.
> >> >>> I agree that it doesn't make sense for it to proxy commit(), though,
> >> as
> >> >>> that would cause it to violate the "StateStores commit only when the
> >> Task
> >> >>> commits" rule.
> >> >>> Instead, I think we should make this a no-op. That way, existing
> user
> >> >>> Processors will continue to work as-before, without violation of
> store
> >> >>> consistency that would be caused by premature flush/commit of
> >> StateStore
> >> >>> data to disk.
> >> >>> What do you think?
> >> >>>
> >> >>> 2.
> >> >>> As stated in the JavaDoc, when a StateStore implementation is
> >> >>> transactional, but is unable to estimate the uncommitted memory
> usage,
> >> >> the
> >> >>> method will return -1.
> >> >>> The intention here is to permit third-party implementations that may
> >> not
> >> >> be
> >> >>> able to estimate memory usage.
> >> >>>
> >> >>> Yes, it will be 0 when nothing has been written to the store yet. I
> >> >> thought
> >> >>> that was implied by "This method will return an approximation of the
> >> >> memory
> >> >>> would be freed by the next call to {@link #commit(Map)}" and
> "@return
> >> The
> >> >>> approximate size of all records awaiting {@link #commit(Map)}",
> >> however,
> >> >> I
> >> >>> can add it explicitly to the JavaDoc if you think this is unclear?
> >> >>>
> >> >>> 3.
> >> >>> I realise this is probably the most contentious point in my design,
> >> and
> >> >> I'm
> >> >>> open to changing it if I'm unable to convince you of the benefits.
> >> >>> Nevertheless, here's my argument:
> >> >>> The Interactive Query (IQ) API(s) are directly provided StateStores
> to
> >> >>> query, and it may be important for users to programmatically know
> >> which
> >> >>> mode the StateStore is operating under. If we simply provide an
> >> >>> "eosEnabled" boolean (as used throughout the internal streams
> >> engine), or
> >> >>> similar, then users will need to understand the operation and
> >> >> consequences
> >> >>> of each available processing mode and how it pertains to their
> >> >> StateStore.
> >> >>>
> >> >>> Interactive Query users aren't the only people that care about the
> >> >>> processing.mode/IsolationLevel of a StateStore: implementers of
> custom
> >> >>> StateStores also need to understand the behaviour expected of their
> >> >>> implementation. KIP-892 introduces some assumptions into the Streams
> >> >> Engine
> >> >>> about how StateStores operate under each processing mode, and it's
> >> >>> important that custom implementations adhere to those assumptions in
> >> >> order
> >> >>> to maintain the consistency guarantees.
> >> >>>
> >> >>> IsolationLevels provide a high-level contract on the behaviour of
> the
> >> >>> StateStore: a user knows that under READ_COMMITTED, they will see
> >> writes
> >> >>> only after the Task has committed, and under READ_UNCOMMITTED they
> >> will
> >> >> see
> >> >>> writes immediately. No understanding of the details of each
> >> >> processing.mode
> >> >>> is required, either for IQ users or StateStore implementers.
> >> >>>
> >> >>> An argument can be made that these contractual guarantees can simply
> >> be
> >> >>> documented for the processing.mode (i.e. that exactly-once and
> >> >>> exactly-once-v2 behave like READ_COMMITTED and at-least-once behaves
> >> like
> >> >>> READ_UNCOMMITTED), but there are several small issues with this I'd
> >> >> prefer
> >> >>> to avoid:
> >> >>>
> >> >>>      - Where would we document these contracts, in a way that is
> >> difficult
> >> >>>      for users/implementers to miss/ignore?
> >> >>>      - It's not clear to users that the processing mode is
> >> communicating
> >> >>>      an expectation of read isolation, unless they read the
> >> >> documentation. Users
> >> >>>      rarely consult documentation unless they feel they need to, so
> >> it's
> >> >> likely
> >> >>>      this detail would get missed by many users.
> >> >>>      - It tightly couples processing modes to read isolation. Adding
> >> new
> >> >>>      processing modes, or changing the read isolation of existing
> >> >> processing
> >> >>>      modes would be difficult/impossible.
> >> >>>
> >> >>> Ultimately, the cost of introducing IsolationLevels is just a single
> >> >>> method, since we re-use the existing IsolationLevel enum from Kafka.
> >> This
> >> >>> gives us a clear place to document the contractual guarantees
> expected
> >> >>> of/provided by StateStores, that is accessible both by the
> StateStore
> >> >>> itself, and by IQ users.
> >> >>>
> >> >>> (Writing this I've just realised that the StateStore and IQ APIs
> >> actually
> >> >>> don't provide access to StateStoreContext that IQ users would have
> >> direct
> >> >>> access to... Perhaps StateStore should expose isolationLevel()
> itself
> >> >> too?)
> >> >>>
> >> >>> 4.
> >> >>> Yeah, I'm not comfortable renaming the metrics in-place either, as
> >> it's a
> >> >>> backwards incompatible change. My concern is that, if we leave the
> >> >> existing
> >> >>> "flush" metrics in place, they will be confusing to users. Right
> now,
> >> >>> "flush" metrics record explicit flushes to disk, but under KIP-892,
> >> even
> >> >> a
> >> >>> commit() will not explicitly flush data to disk - RocksDB will
> decide
> >> on
> >> >>> when to flush memtables to disk itself.
> >> >>>
> >> >>> If we keep the existing "flush" metrics, we'd have two options,
> which
> >> >> both
> >> >>> seem pretty bad to me:
> >> >>>
> >> >>>      1. Have them record calls to commit(), which would be
> >> misleading, as
> >> >>>      data is no longer explicitly "flushed" to disk by this call.
> >> >>>      2. Have them record nothing at all, which is equivalent to
> >> removing
> >> >> the
> >> >>>      metrics, except that users will see the metric still exists and
> >> so
> >> >> assume
> >> >>>      that the metric is correct, and that there's a problem with
> their
> >> >> system
> >> >>>      when there isn't.
> >> >>>
> >> >>> I agree that removing them is also a bad solution, and I'd like some
> >> >>> guidance on the best path forward here.
> >> >>>
> >> >>> 5.
> >> >>> Position files are updated on every write to a StateStore. Since our
> >> >> writes
> >> >>> are now buffered until commit(), we can't update the Position file
> >> until
> >> >>> commit() has been called, otherwise it would be inconsistent with
> the
> >> >> data
> >> >>> in the event of a rollback. Consequently, we need to manage these
> >> offsets
> >> >>> the same way we manage the checkpoint offsets, and ensure they're
> only
> >> >>> written on commit().
> >> >>>
> >> >>> 6.
> >> >>> Agreed, although I'm not exactly sure yet what tests to write. How
> >> >> explicit
> >> >>> do we need to be here in the KIP?
> >> >>>
> >> >>> As for upgrade/downgrade: upgrade is designed to be seamless, and we
> >> >> should
> >> >>> definitely add some tests around that. Downgrade, it transpires,
> isn't
> >> >>> currently possible, as the extra column family for offset storage is
> >> >>> incompatible with the pre-KIP-892 implementation: when you open a
> >> RocksDB
> >> >>> database, you must open all available column families or receive an
> >> >> error.
> >> >>> What currently happens on downgrade is that it attempts to open the
> >> >> store,
> >> >>> throws an error about the offsets column family not being opened,
> >> which
> >> >>> triggers a wipe and rebuild of the Task. Given that downgrades
> should
> >> be
> >> >>> uncommon, I think this is acceptable behaviour, as the end-state is
> >> >>> consistent, even if it results in an undesirable state restore.
> >> >>>
> >> >>> Should I document the upgrade/downgrade behaviour explicitly in the
> >> KIP?
> >> >>>
> >> >>> --
> >> >>>
> >> >>> Regards,
> >> >>> Nick
> >> >>>
> >> >>>
> >> >>> On Mon, 14 Aug 2023 at 22:31, Bruno Cadonna <cado...@apache.org>
> >> wrote:
> >> >>>
> >> >>>> Hi Nick!
> >> >>>>
> >> >>>> Thanks for the updates!
> >> >>>>
> >> >>>> 1.
> >> >>>> Why does StateStore#flush() default to
> >> >>>> StateStore#commit(Collections.emptyMap())?
> >> >>>> Since calls to flush() will not exist anymore after this KIP is
> >> >>>> released, I would rather throw an unsupported operation exception
> by
> >> >>>> default.
> >> >>>>
> >> >>>>
> >> >>>> 2.
> >> >>>> When would a state store return -1 from
> >> >>>> StateStore#approximateNumUncommittedBytes() while being
> >> transactional?
> >> >>>>
> >> >>>> Wouldn't StateStore#approximateNumUncommittedBytes() also return 0
> if
> >> >>>> the state store is transactional but nothing has been written to
> the
> >> >>>> state store yet?
> >> >>>>
> >> >>>>
> >> >>>> 3.
> >> >>>> Sorry for bringing this up again. Does this KIP really need to
> >> introduce
> >> >>>> StateStoreContext#isolationLevel()? StateStoreContext has already
> >> >>>> appConfigs() which basically exposes the same information, i.e., if
> >> EOS
> >> >>>> is enabled or not.
> >> >>>> In one of your previous e-mails you wrote:
> >> >>>>
> >> >>>> "My idea was to try to keep the StateStore interface as loosely
> >> coupled
> >> >>>> from the Streams engine as possible, to give implementers more
> >> freedom,
> >> >>>> and reduce the amount of internal knowledge required."
> >> >>>>
> >> >>>> While I understand the intent, I doubt that it decreases the
> >> coupling of
> >> >>>> a StateStore interface and the Streams engine. READ_COMMITTED only
> >> >>>> applies to IQ but not to reads by processors. Thus, implementers
> >> need to
> >> >>>> understand how Streams accesses the state stores.
> >> >>>>
> >> >>>> I would like to hear what others think about this.
> >> >>>>
> >> >>>>
> >> >>>> 4.
> >> >>>> Great exposing new metrics for transactional state stores!
> However, I
> >> >>>> would prefer to add new metrics and deprecate (in the docs) the old
> >> >>>> ones. You can find examples of deprecated metrics here:
> >> >>>> https://kafka.apache.org/documentation/#selector_monitoring
> >> >>>>
> >> >>>>
> >> >>>> 5.
> >> >>>> Why does the KIP mention position files? I do not think they are
> >> related
> >> >>>> to transactions or flushes.
> >> >>>>
> >> >>>>
> >> >>>> 6.
> >> >>>> I think we will also need to adapt/add integration tests besides
> unit
> >> >>>> tests. Additionally, we probably need integration or system tests
> to
> >> >>>> verify that upgrades and downgrades between transactional and
> >> >>>> non-transactional state stores work as expected.
> >> >>>>
> >> >>>>
> >> >>>> Best,
> >> >>>> Bruno
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On 7/21/23 10:34 PM, Nick Telford wrote:
> >> >>>>> One more thing: I noted John's suggestion in the KIP, under
> >> "Rejected
> >> >>>>> Alternatives". I still think it's an idea worth pursuing, but I
> >> believe
> >> >>>>> that it's out of the scope of this KIP, because it solves a
> >> different
> >> >> set
> >> >>>>> of problems to this KIP, and the scope of this one has already
> grown
> >> >>>> quite
> >> >>>>> large!
> >> >>>>>
> >> >>>>> On Fri, 21 Jul 2023 at 21:33, Nick Telford <
> nick.telf...@gmail.com>
> >> >>>> wrote:
> >> >>>>>
> >> >>>>>> Hi everyone,
> >> >>>>>>
> >> >>>>>> I've updated the KIP (
> >> >>>>>>
> >> >>>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-892%3A+Transactional+Semantics+for+StateStores
> >> >>>> )
> >> >>>>>> with the latest changes; mostly bringing back "Atomic
> >> Checkpointing"
> >> >>>> (for
> >> >>>>>> what feels like the 10th time!). I think the one thing missing is
> >> some
> >> >>>>>> changes to metrics (notably the store "flush" metrics will need
> to
> >> be
> >> >>>>>> renamed to "commit").
> >> >>>>>>
> >> >>>>>> The reason I brought back Atomic Checkpointing was to decouple
> >> store
> >> >>>> flush
> >> >>>>>> from store commit. This is important, because with Transactional
> >> >>>>>> StateStores, we now need to call "flush" on *every* Task commit,
> >> and
> >> >> not
> >> >>>>>> just when the StateStore is closing, otherwise our transaction
> >> buffer
> >> >>>> will
> >> >>>>>> never be written and persisted, instead growing unbounded! I
> >> >>>> experimented
> >> >>>>>> with some simple solutions, like forcing a store flush whenever
> the
> >> >>>>>> transaction buffer was likely to exceed its configured size, but
> >> this
> >> >>>> was
> >> >>>>>> brittle: it prevented the transaction buffer from being
> configured
> >> to
> >> >> be
> >> >>>>>> unbounded, and it still would have required explicit flushes of
> >> >> RocksDB,
> >> >>>>>> yielding sub-optimal performance and memory utilization.
> >> >>>>>>
> >> >>>>>> I deemed Atomic Checkpointing to be the "right" way to resolve
> this
> >> >>>>>> problem. By ensuring that the changelog offsets that correspond
> to
> >> the
> >> >>>> most
> >> >>>>>> recently written records are always atomically written to the
> >> >> StateStore
> >> >>>>>> (by writing them to the same transaction buffer), we can avoid
> >> >> forcibly
> >> >>>>>> flushing the RocksDB memtables to disk, letting RocksDB flush
> them
> >> >> only
> >> >>>>>> when necessary, without losing any of our consistency guarantees.
> >> See
> >> >>>> the
> >> >>>>>> updated KIP for more info.
> >> >>>>>>
> >> >>>>>> I have fully implemented these changes, although I'm still not
> >> >> entirely
> >> >>>>>> happy with the implementation for segmented StateStores, so I
> plan
> >> to
> >> >>>>>> refactor that. Despite that, all tests pass. If you'd like to try
> >> out
> >> >> or
> >> >>>>>> review this highly experimental and incomplete branch, it's
> >> available
> >> >>>> here:
> >> >>>>>> https://github.com/nicktelford/kafka/tree/KIP-892-3.5.0. Note:
> >> it's
> >> >>>> built
> >> >>>>>> against Kafka 3.5.0 so that I had a stable base to build and test
> >> it
> >> >> on,
> >> >>>>>> and to enable easy apples-to-apples comparisons in a live
> >> >> environment. I
> >> >>>>>> plan to rebase it against trunk once it's nearer completion and
> has
> >> >> been
> >> >>>>>> proven on our main application.
> >> >>>>>>
> >> >>>>>> I would really appreciate help in reviewing and testing:
> >> >>>>>> - Segmented (Versioned, Session and Window) stores
> >> >>>>>> - Global stores
> >> >>>>>>
> >> >>>>>> As I do not currently use either of these, so my primary test
> >> >>>> environment
> >> >>>>>> doesn't test these areas.
> >> >>>>>>
> >> >>>>>> I'm going on Parental Leave starting next week for a few weeks,
> so
> >> >> will
> >> >>>>>> not have time to move this forward until late August. That said,
> >> your
> >> >>>>>> feedback is welcome and appreciated, I just won't be able to
> >> respond
> >> >> as
> >> >>>>>> quickly as usual.
> >> >>>>>>
> >> >>>>>> Regards,
> >> >>>>>> Nick
> >> >>>>>>
> >> >>>>>> On Mon, 3 Jul 2023 at 16:23, Nick Telford <
> nick.telf...@gmail.com>
> >> >>>> wrote:
> >> >>>>>>
> >> >>>>>>> Hi Bruno
> >> >>>>>>>
> >> >>>>>>> Yes, that's correct, although the impact on IQ is not something
> I
> >> had
> >> >>>>>>> considered.
> >> >>>>>>>
> >> >>>>>>> What about atomically updating the state store from the
> >> transaction
> >> >>>>>>>> buffer every commit interval and writing the checkpoint (thus,
> >> >>>> flushing
> >> >>>>>>>> the memtable) every configured amount of data and/or number of
> >> >> commit
> >> >>>>>>>> intervals?
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> I'm not quite sure I follow. Are you suggesting that we add an
> >> >>>> additional
> >> >>>>>>> config for the max number of commit intervals between
> checkpoints?
> >> >> That
> >> >>>>>>> way, we would checkpoint *either* when the transaction buffers
> are
> >> >>>> nearly
> >> >>>>>>> full, *OR* whenever a certain number of commit intervals have
> >> >> elapsed,
> >> >>>>>>> whichever comes first?
> >> >>>>>>>
> >> >>>>>>> That certainly seems reasonable, although this re-ignites an
> >> earlier
> >> >>>>>>> debate about whether a config should be measured in "number of
> >> commit
> >> >>>>>>> intervals", instead of just an absolute time.
> >> >>>>>>>
> >> >>>>>>> FWIW, I realised that this issue is the reason I was pursuing
> the
> >> >>>> Atomic
> >> >>>>>>> Checkpoints, as it de-couples memtable flush from checkpointing,
> >> >> which
> >> >>>>>>> enables us to just checkpoint on every commit without any
> >> performance
> >> >>>>>>> impact. Atomic Checkpointing is definitely the "best" solution,
> >> but
> >> >>>> I'm not
> >> >>>>>>> sure if this is enough to bring it back into this KIP.
> >> >>>>>>>
> >> >>>>>>> I'm currently working on moving all the transactional logic
> >> directly
> >> >>>> into
> >> >>>>>>> RocksDBStore itself, which does away with the
> >> >> StateStore#newTransaction
> >> >>>>>>> method, and reduces the number of new classes introduced,
> >> >> significantly
> >> >>>>>>> reducing the complexity. If it works, and the complexity is
> >> >> drastically
> >> >>>>>>> reduced, I may try bringing back Atomic Checkpoints into this
> KIP.
> >> >>>>>>>
> >> >>>>>>> Regards,
> >> >>>>>>> Nick
> >> >>>>>>>
> >> >>>>>>> On Mon, 3 Jul 2023 at 15:27, Bruno Cadonna <cado...@apache.org>
> >> >> wrote:
> >> >>>>>>>
> >> >>>>>>>> Hi Nick,
> >> >>>>>>>>
> >> >>>>>>>> Thanks for the insights! Very interesting!
> >> >>>>>>>>
> >> >>>>>>>> As far as I understand, you want to atomically update the state
> >> >> store
> >> >>>>>>>> from the transaction buffer, flush the memtable of a state
> store
> >> and
> >> >>>>>>>> write the checkpoint not after the commit time elapsed but
> after
> >> the
> >> >>>>>>>> transaction buffer reached a size that would lead to exceeding
> >> >>>>>>>> statestore.transaction.buffer.max.bytes before the next commit
> >> >>>> interval
> >> >>>>>>>> ends.
> >> >>>>>>>> That means, the Kafka transaction would commit every commit
> >> interval
> >> >>>> but
> >> >>>>>>>> the state store will only be atomically updated roughly every
> >> >>>>>>>> statestore.transaction.buffer.max.bytes of data. Also IQ would
> >> then
> >> >>>> only
> >> >>>>>>>> see new data roughly every
> >> statestore.transaction.buffer.max.bytes.
> >> >>>>>>>> After a failure the state store needs to restore up to
> >> >>>>>>>> statestore.transaction.buffer.max.bytes.
> >> >>>>>>>>
> >> >>>>>>>> Is this correct?
> >> >>>>>>>>
> >> >>>>>>>> What about atomically updating the state store from the
> >> transaction
> >> >>>>>>>> buffer every commit interval and writing the checkpoint (thus,
> >> >>>> flushing
> >> >>>>>>>> the memtable) every configured amount of data and/or number of
> >> >> commit
> >> >>>>>>>> intervals? In such a way, we would have the same delay for
> >> records
> >> >>>>>>>> appearing in output topics and IQ because both would appear
> when
> >> the
> >> >>>>>>>> Kafka transaction is committed. However, after a failure the
> >> state
> >> >>>> store
> >> >>>>>>>> still needs to restore up to
> >> statestore.transaction.buffer.max.bytes
> >> >>>> and
> >> >>>>>>>> it might restore data that is already in the state store
> because
> >> the
> >> >>>>>>>> checkpoint lags behind the last stable offset (i.e. the last
> >> >> committed
> >> >>>>>>>> offset) of the changelog topics. Restoring data that is already
> >> in
> >> >> the
> >> >>>>>>>> state store is idempotent, so eos should not violated.
> >> >>>>>>>> This solution needs at least one new config to specify when a
> >> >>>> checkpoint
> >> >>>>>>>> should be written.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> A small correction to your previous e-mail that does not change
> >> >>>> anything
> >> >>>>>>>> you said: Under alos the default commit interval is 30 seconds,
> >> not
> >> >>>> five
> >> >>>>>>>> seconds.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Best,
> >> >>>>>>>> Bruno
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> On 01.07.23 12:37, Nick Telford wrote:
> >> >>>>>>>>> Hi everyone,
> >> >>>>>>>>>
> >> >>>>>>>>> I've begun performance testing my branch on our staging
> >> >> environment,
> >> >>>>>>>>> putting it through its paces in our non-trivial application.
> I'm
> >> >>>>>>>> already
> >> >>>>>>>>> observing the same increased flush rate that we saw the last
> >> time
> >> >> we
> >> >>>>>>>>> attempted to use a version of this KIP, but this time, I
> think I
> >> >> know
> >> >>>>>>>> why.
> >> >>>>>>>>>
> >> >>>>>>>>> Pre-KIP-892, StreamTask#postCommit, which is called at the end
> >> of
> >> >> the
> >> >>>>>>>> Task
> >> >>>>>>>>> commit process, has the following behaviour:
> >> >>>>>>>>>
> >> >>>>>>>>>        - Under ALOS: checkpoint the state stores. This
> includes
> >> >>>>>>>>>        flushing memtables in RocksDB. This is acceptable
> >> because the
> >> >>>>>>>> default
> >> >>>>>>>>>        commit.interval.ms is 5 seconds, so forcibly flushing
> >> >> memtables
> >> >>>>>>>> every 5
> >> >>>>>>>>>        seconds is acceptable for most applications.
> >> >>>>>>>>>        - Under EOS: checkpointing is not done, *unless* it's
> >> being
> >> >>>>>>>> forced, due
> >> >>>>>>>>>        to e.g. the Task closing or being revoked. This means
> >> that
> >> >> under
> >> >>>>>>>> normal
> >> >>>>>>>>>        processing conditions, the state stores will not be
> >> >>>> checkpointed,
> >> >>>>>>>> and will
> >> >>>>>>>>>        not have memtables flushed at all , unless RocksDB
> >> decides to
> >> >>>>>>>> flush them on
> >> >>>>>>>>>        its own. Checkpointing stores and force-flushing their
> >> >> memtables
> >> >>>>>>>> is only
> >> >>>>>>>>>        done when a Task is being closed.
> >> >>>>>>>>>
> >> >>>>>>>>> Under EOS, KIP-892 needs to checkpoint stores on at least
> *some*
> >> >>>> normal
> >> >>>>>>>>> Task commits, in order to write the RocksDB transaction
> buffers
> >> to
> >> >>>> the
> >> >>>>>>>>> state stores, and to ensure the offsets are synced to disk to
> >> >> prevent
> >> >>>>>>>>> restores from getting out of hand. Consequently, my current
> >> >>>>>>>> implementation
> >> >>>>>>>>> calls maybeCheckpoint on *every* Task commit, which is far too
> >> >>>>>>>> frequent.
> >> >>>>>>>>> This causes checkpoints every 10,000 records, which is a
> change
> >> in
> >> >>>>>>>> flush
> >> >>>>>>>>> behaviour, potentially causing performance problems for some
> >> >>>>>>>> applications.
> >> >>>>>>>>>
> >> >>>>>>>>> I'm looking into possible solutions, and I'm currently leaning
> >> >>>> towards
> >> >>>>>>>>> using the statestore.transaction.buffer.max.bytes
> configuration
> >> to
> >> >>>>>>>>> checkpoint Tasks once we are likely to exceed it. This would
> >> >>>>>>>> complement the
> >> >>>>>>>>> existing "early Task commit" functionality that this
> >> configuration
> >> >>>>>>>>> provides, in the following way:
> >> >>>>>>>>>
> >> >>>>>>>>>        - Currently, we use
> >> statestore.transaction.buffer.max.bytes
> >> >> to
> >> >>>>>>>> force an
> >> >>>>>>>>>        early Task commit if processing more records would
> cause
> >> our
> >> >>>> state
> >> >>>>>>>> store
> >> >>>>>>>>>        transactions to exceed the memory assigned to them.
> >> >>>>>>>>>        - New functionality: when a Task *does* commit, we will
> >> not
> >> >>>>>>>> checkpoint
> >> >>>>>>>>>        the stores (and hence flush the transaction buffers)
> >> unless
> >> >> we
> >> >>>>>>>> expect to
> >> >>>>>>>>>        cross the statestore.transaction.buffer.max.bytes
> >> threshold
> >> >>>> before
> >> >>>>>>>> the next
> >> >>>>>>>>>        commit
> >> >>>>>>>>>
> >> >>>>>>>>> I'm also open to suggestions.
> >> >>>>>>>>>
> >> >>>>>>>>> Regards,
> >> >>>>>>>>> Nick
> >> >>>>>>>>>
> >> >>>>>>>>> On Thu, 22 Jun 2023 at 14:06, Nick Telford <
> >> nick.telf...@gmail.com
> >> >>>
> >> >>>>>>>> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Bruno!
> >> >>>>>>>>>>
> >> >>>>>>>>>> 3.
> >> >>>>>>>>>> By "less predictable for users", I meant in terms of
> >> understanding
> >> >>>> the
> >> >>>>>>>>>> performance profile under various circumstances. The more
> >> complex
> >> >>>> the
> >> >>>>>>>>>> solution, the more difficult it would be for users to
> >> understand
> >> >> the
> >> >>>>>>>>>> performance they see. For example, spilling records to disk
> >> when
> >> >> the
> >> >>>>>>>>>> transaction buffer reaches a threshold would, I expect,
> reduce
> >> >> write
> >> >>>>>>>>>> throughput. This reduction in write throughput could be
> >> >> unexpected,
> >> >>>>>>>> and
> >> >>>>>>>>>> potentially difficult to diagnose/understand for users.
> >> >>>>>>>>>> At the moment, I think the "early commit" concept is
> relatively
> >> >>>>>>>>>> straightforward; it's easy to document, and conceptually
> fairly
> >> >>>>>>>> obvious to
> >> >>>>>>>>>> users. We could probably add a metric to make it easier to
> >> >>>> understand
> >> >>>>>>>> when
> >> >>>>>>>>>> it happens though.
> >> >>>>>>>>>>
> >> >>>>>>>>>> 3. (the second one)
> >> >>>>>>>>>> The IsolationLevel is *essentially* an indirect way of
> telling
> >> >>>>>>>> StateStores
> >> >>>>>>>>>> whether they should be transactional. READ_COMMITTED
> >> essentially
> >> >>>>>>>> requires
> >> >>>>>>>>>> transactions, because it dictates that two threads calling
> >> >>>>>>>>>> `newTransaction()` should not see writes from the other
> >> >> transaction
> >> >>>>>>>> until
> >> >>>>>>>>>> they have been committed. With READ_UNCOMMITTED, all bets are
> >> off,
> >> >>>> and
> >> >>>>>>>>>> stores can allow threads to observe written records at any
> >> time,
> >> >>>>>>>> which is
> >> >>>>>>>>>> essentially "no transactions". That said, StateStores are
> free
> >> to
> >> >>>>>>>> implement
> >> >>>>>>>>>> these guarantees however they can, which is a bit more
> relaxed
> >> >> than
> >> >>>>>>>>>> dictating "you must use transactions". For example, with
> >> RocksDB
> >> >> we
> >> >>>>>>>> would
> >> >>>>>>>>>> implement these as READ_COMMITTED == WBWI-based
> "transactions",
> >> >>>>>>>>>> READ_UNCOMMITTED == direct writes to the database. But with
> >> other
> >> >>>>>>>> storage
> >> >>>>>>>>>> engines, it might be preferable to *always* use transactions,
> >> even
> >> >>>>>>>> when
> >> >>>>>>>>>> unnecessary; or there may be storage engines that don't
> provide
> >> >>>>>>>>>> transactions, but the isolation guarantees can be met using a
> >> >>>>>>>> different
> >> >>>>>>>>>> technique.
> >> >>>>>>>>>> My idea was to try to keep the StateStore interface as
> loosely
> >> >>>> coupled
> >> >>>>>>>>>> from the Streams engine as possible, to give implementers
> more
> >> >>>>>>>> freedom, and
> >> >>>>>>>>>> reduce the amount of internal knowledge required.
> >> >>>>>>>>>> That said, I understand that "IsolationLevel" might not be
> the
> >> >> right
> >> >>>>>>>>>> abstraction, and we can always make it much more explicit if
> >> >>>>>>>> required, e.g.
> >> >>>>>>>>>> boolean transactional()
> >> >>>>>>>>>>
> >> >>>>>>>>>> 7-8.
> >> >>>>>>>>>> I can make these changes either later today or tomorrow.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Small update:
> >> >>>>>>>>>> I've rebased my branch on trunk and fixed a bunch of issues
> >> that
> >> >>>>>>>> needed
> >> >>>>>>>>>> addressing. Currently, all the tests pass, which is
> promising,
> >> but
> >> >>>> it
> >> >>>>>>>> will
> >> >>>>>>>>>> need to undergo some performance testing. I haven't (yet)
> >> worked
> >> >> on
> >> >>>>>>>>>> removing the `newTransaction()` stuff, but I would expect
> that,
> >> >>>>>>>>>> behaviourally, it should make no difference. The branch is
> >> >> available
> >> >>>>>>>> at
> >> >>>>>>>>>> https://github.com/nicktelford/kafka/tree/KIP-892-c if
> anyone
> >> is
> >> >>>>>>>>>> interested in taking an early look.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Regards,
> >> >>>>>>>>>> Nick
> >> >>>>>>>>>>
> >> >>>>>>>>>> On Thu, 22 Jun 2023 at 11:59, Bruno Cadonna <
> >> cado...@apache.org>
> >> >>>>>>>> wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Hi Nick,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 1.
> >> >>>>>>>>>>> Yeah, I agree with you. That was actually also my point. I
> >> >>>> understood
> >> >>>>>>>>>>> that John was proposing the ingestion path as a way to avoid
> >> the
> >> >>>>>>>> early
> >> >>>>>>>>>>> commits. Probably, I misinterpreted the intent.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 2.
> >> >>>>>>>>>>> I agree with John here, that actually it is public API. My
> >> >> question
> >> >>>>>>>> is
> >> >>>>>>>>>>> how this usage pattern affects normal processing.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 3.
> >> >>>>>>>>>>> My concern is that checking for the size of the transaction
> >> >> buffer
> >> >>>>>>>> and
> >> >>>>>>>>>>> maybe triggering an early commit affects the whole
> processing
> >> of
> >> >>>>>>>> Kafka
> >> >>>>>>>>>>> Streams. The transactionality of a state store is not
> >> confined to
> >> >>>> the
> >> >>>>>>>>>>> state store itself, but spills over and changes the behavior
> >> of
> >> >>>> other
> >> >>>>>>>>>>> parts of the system. I agree with you that it is a decent
> >> >>>>>>>> compromise. I
> >> >>>>>>>>>>> just wanted to analyse the downsides and list the options to
> >> >>>> overcome
> >> >>>>>>>>>>> them. I also agree with you that all options seem quite
> heavy
> >> >>>>>>>> compared
> >> >>>>>>>>>>> with your KIP. I do not understand what you mean with "less
> >> >>>>>>>> predictable
> >> >>>>>>>>>>> for users", though.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I found the discussions about the alternatives really
> >> >> interesting.
> >> >>>>>>>> But I
> >> >>>>>>>>>>> also think that your plan sounds good and we should continue
> >> with
> >> >>>> it!
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Some comments on your reply to my e-mail on June 20th:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 3.
> >> >>>>>>>>>>> Ah, now, I understand the reasoning behind putting isolation
> >> >> level
> >> >>>> in
> >> >>>>>>>>>>> the state store context. Thanks! Should that also be a way
> to
> >> >> give
> >> >>>>>>>> the
> >> >>>>>>>>>>> the state store the opportunity to decide whether to turn on
> >> >>>>>>>>>>> transactions or not?
> >> >>>>>>>>>>> With my comment, I was more concerned about how do you know
> >> if a
> >> >>>>>>>>>>> checkpoint file needs to be written under EOS, if you do not
> >> >> have a
> >> >>>>>>>> way
> >> >>>>>>>>>>> to know if the state store is transactional or not. If a
> state
> >> >>>> store
> >> >>>>>>>> is
> >> >>>>>>>>>>> transactional, the checkpoint file can be written during
> >> normal
> >> >>>>>>>>>>> processing under EOS. If the state store is not
> transactional,
> >> >> the
> >> >>>>>>>>>>> checkpoint file must not be written under EOS.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 7.
> >> >>>>>>>>>>> My point was about not only considering the bytes in memory
> in
> >> >>>> config
> >> >>>>>>>>>>> statestore.uncommitted.max.bytes, but also bytes that might
> be
> >> >>>>>>>> spilled
> >> >>>>>>>>>>> on disk. Basically, I was wondering whether you should
> remove
> >> the
> >> >>>>>>>>>>> "memory" in "Maximum number of memory bytes to be used to
> >> >>>>>>>>>>> buffer uncommitted state-store records." My thinking was
> that
> >> >> even
> >> >>>>>>>> if a
> >> >>>>>>>>>>> state store spills uncommitted bytes to disk, limiting the
> >> >> overall
> >> >>>>>>>> bytes
> >> >>>>>>>>>>> might make sense. Thinking about it again and considering
> the
> >> >>>> recent
> >> >>>>>>>>>>> discussions, it does not make too much sense anymore.
> >> >>>>>>>>>>> I like the name statestore.transaction.buffer.max.bytes that
> >> you
> >> >>>>>>>> proposed.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> 8.
> >> >>>>>>>>>>> A high-level description (without implementation details) of
> >> how
> >> >>>>>>>> Kafka
> >> >>>>>>>>>>> Streams will manage the commit of changelog transactions,
> >> state
> >> >>>> store
> >> >>>>>>>>>>> transactions and checkpointing would be great. Would be
> great
> >> if
> >> >>>> you
> >> >>>>>>>>>>> could also add some sentences about the behavior in case of
> a
> >> >>>>>>>> failure.
> >> >>>>>>>>>>> For instance how does a transactional state store recover
> >> after a
> >> >>>>>>>>>>> failure or what happens with the transaction buffer, etc.
> >> (that
> >> >> is
> >> >>>>>>>> what
> >> >>>>>>>>>>> I meant by "fail-over" in point 9.)
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Best,
> >> >>>>>>>>>>> Bruno
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On 21.06.23 18:50, Nick Telford wrote:
> >> >>>>>>>>>>>> Hi Bruno,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> 1.
> >> >>>>>>>>>>>> Isn't this exactly the same issue that WriteBatchWithIndex
> >> >>>>>>>> transactions
> >> >>>>>>>>>>>> have, whereby exceeding (or likely to exceed) configured
> >> memory
> >> >>>>>>>> needs to
> >> >>>>>>>>>>>> trigger an early commit?
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> 2.
> >> >>>>>>>>>>>> This is one of my big concerns. Ultimately, any approach
> >> based
> >> >> on
> >> >>>>>>>>>>> cracking
> >> >>>>>>>>>>>> open RocksDB internals and using it in ways it's not really
> >> >>>> designed
> >> >>>>>>>>>>> for is
> >> >>>>>>>>>>>> likely to have some unforseen performance or consistency
> >> issues.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> 3.
> >> >>>>>>>>>>>> What's your motivation for removing these early commits?
> >> While
> >> >> not
> >> >>>>>>>>>>> ideal, I
> >> >>>>>>>>>>>> think they're a decent compromise to ensure consistency
> >> whilst
> >> >>>>>>>>>>> maintaining
> >> >>>>>>>>>>>> good and predictable performance.
> >> >>>>>>>>>>>> All 3 of your suggested ideas seem *very* complicated, and
> >> might
> >> >>>>>>>>>>> actually
> >> >>>>>>>>>>>> make behaviour less predictable for users as a consequence.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> I'm a bit concerned that the scope of this KIP is growing a
> >> bit
> >> >>>> out
> >> >>>>>>>> of
> >> >>>>>>>>>>>> control. While it's good to discuss ideas for future
> >> >>>> improvements, I
> >> >>>>>>>>>>> think
> >> >>>>>>>>>>>> it's important to narrow the scope down to a design that
> >> >> achieves
> >> >>>>>>>> the
> >> >>>>>>>>>>> most
> >> >>>>>>>>>>>> pressing objectives (constant sized restorations during
> dirty
> >> >>>>>>>>>>>> close/unexpected errors). Any design that this KIP produces
> >> can
> >> >>>>>>>>>>> ultimately
> >> >>>>>>>>>>>> be changed in the future, especially if the bulk of it is
> >> >> internal
> >> >>>>>>>>>>>> behaviour.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> I'm going to spend some time next week trying to re-work
> the
> >> >>>>>>>> original
> >> >>>>>>>>>>>> WriteBatchWithIndex design to remove the newTransaction()
> >> >> method,
> >> >>>>>>>> such
> >> >>>>>>>>>>> that
> >> >>>>>>>>>>>> it's just an implementation detail of RocksDBStore. That
> >> way, if
> >> >>>> we
> >> >>>>>>>>>>> want to
> >> >>>>>>>>>>>> replace WBWI with something in the future, like the SST
> file
> >> >>>>>>>> management
> >> >>>>>>>>>>>> outlined by John, then we can do so with little/no API
> >> changes.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Regards,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>> Nick
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-892: Transactional Semantics for StateStores

Reply via email to