Re: [DISCUSS] KIP-844: Transactional State Stores

Matthias J. Sax Thu, 22 Feb 2024 19:08:26 -0800

To close the loop on this thread. KIP-892 was accepted and is currentlyimplemented. Thus I'll go a head and mark this KIP a discarded.

Thanks a lot Alex for spending so much time on this very importantfeature! Without your ground work, we would not have KIP-892 and yourcontributions are noticed!


-Matthias


On 11/21/22 5:12 AM, Nick Telford wrote:

Hi Alex,

Thanks for getting back to me. I actually have most of a working
implementation already. I'm going to write it up as a new KIP, so that it
can be reviewed independently of KIP-844.

Hopefully, working together we can have it ready sooner.

I'll keep you posted on my progress.

Regards,
Nick

On Mon, 21 Nov 2022 at 11:25, Alexander Sorokoumov
<asorokou...@confluent.io.invalid> wrote:

Hey Nick,

Thank you for the prototype testing and benchmarking, and sorry for the
late reply!

I agree that it is worth revisiting the WriteBatchWithIndex approach. I
will implement a fork of the current prototype that uses that mechanism to
ensure transactionality and let you know when it is ready for
review/testing in this ML thread.

As for time estimates, I might not have enough time to finish the prototype
in December, so it will probably be ready for review in January.

Best,
Alex

On Fri, Nov 11, 2022 at 4:24 PM Nick Telford <nick.telf...@gmail.com>
wrote:

Hi everyone,

Sorry to dredge this up again. I've had a chance to start doing some
testing with the WIP Pull Request, and it appears as though the secondary
store solution performs rather poorly.

In our testing, we had a non-transactional state store that would restore
(from scratch), at a rate of nearly 1,000,000 records/second. When we
switched it to a transactional store, it restored at a rate of less than
40,000 records/second.

I suspect the key issues here are having to copy the data out of the
temporary store and into the main store on-commit, and to a lesser

extent,

the extra memory copies during writes.

I think it's worth re-visiting the WriteBatchWithIndex solution, as it's
clear from the RocksDB post[1] on the subject that it's the recommended

way

to achieve transactionality.

The only issue you identified with this solution was that uncommitted
writes are required to entirely fit in-memory, and RocksDB recommends

they

don't exceed 3-4MiB. If we do some back-of-the-envelope calculations, I
think we'll find that this will be a non-issue for all but the most

extreme

cases, and for those, I think I have a fairly simple solution.

Firstly, when EOS is enabled, the default commit.interval.ms is set to
100ms, which provides fairly short intervals that uncommitted writes need
to be buffered in-memory. If we assume a worst case of 1024 byte records
(and for most cases, they should be much smaller), then 4MiB would hold
~4096 records, which with 100ms commit intervals is a throughput of
approximately 40,960 records/second. This seems quite reasonable.

For use cases that wouldn't reasonably fit in-memory, my suggestion is

that

we have a mechanism that tracks the number/size of uncommitted records in
stores, and prematurely commits the Task when this size exceeds a
configured threshold.

Thanks for your time, and let me know what you think!
--
Nick

1: https://rocksdb.org/blog/2015/02/27/write-batch-with-index.html

On Thu, 6 Oct 2022 at 19:31, Alexander Sorokoumov
<asorokou...@confluent.io.invalid> wrote:

Hey Nick,

It is going to be option c. Existing state is considered to be

committed

and there will be an additional RocksDB for uncommitted writes.

I am out of office until October 24. I will update KIP and make sure

that

we have an upgrade test for that after coming back from vacation.

Best,
Alex

On Thu, Oct 6, 2022 at 5:06 PM Nick Telford <nick.telf...@gmail.com>
wrote:

Hi everyone,

I realise this has already been voted on and accepted, but it

occurred

to

me today that the KIP doesn't define the migration/upgrade path for
existing non-transactional StateStores that *become* transactional,

i.e.

by

adding the transactional boolean to the StateStore constructor.

What would be the result, when such a change is made to a Topology,

without

explicitly wiping the application state?
a) An error.
b) Local state is wiped.
c) Existing RocksDB database is used as committed writes and new

RocksDB

database is created for uncommitted writes.
d) Something else?

Regards,

Nick

On Thu, 1 Sept 2022 at 12:16, Alexander Sorokoumov
<asorokou...@confluent.io.invalid> wrote:

Hey Guozhang,

Sounds good. I annotated all added StateStore methods (commit,

recover,

transactional) with @Evolving.

Best,
Alex



On Wed, Aug 31, 2022 at 7:32 PM Guozhang Wang <wangg...@gmail.com>

wrote:

Hello Alex,

Thanks for the detailed replies, I think that makes sense, and in

the

long

run we would need some public indicators from StateStore to

determine

if

checkpoints can really be used to indicate clean snapshots.

As for the @Evolving label, I think we can still keep it but for

different reason, since as we add more state management

functionalities

in

the near future we may need to revisit the public APIs again and

hence

keeping it as @Evolving would allow us to modify if necessary, in

an

easier

path than deprecate -> delete after several minor releases.

Besides that, I have no further comments about the KIP.


Guozhang

On Fri, Aug 26, 2022 at 1:51 AM Alexander Sorokoumov
<asorokou...@confluent.io.invalid> wrote:

Hey Guozhang,


I think that we will have to keep StateStore#transactional()

because

post-commit checkpointing of non-txn state stores will break

the

guarantees

we want in

ProcessorStateManager#initializeStoreOffsetsFromCheckpoint

for

correct recovery. Let's consider checkpoint-recovery behavior

under

EOS

that we want to support:

1. Non-txn state stores should checkpoint on graceful shutdown

and

restore

from that checkpoint.

2. Non-txn state stores should delete local data during

recovery

after

crash failure.

3. Txn state stores should checkpoint on commit and on graceful

shutdown.

These stores should roll back uncommitted changes instead of

deleting

all

local data.


#1 and #2 are already supported; this proposal adds #3.

Essentially,

we

have two parties at play here - the post-commit checkpointing

in

StreamTask#postCommit and recovery in ProcessorStateManager#
initializeStoreOffsetsFromCheckpoint. Together, these methods

must

allow

all three workflows and prevent invalid behavior, e.g., non-txn

stores

should not checkpoint post-commit to avoid keeping uncommitted

data

on

recovery.


In the current state of the prototype, we checkpoint only txn

state

stores

post-commit under EOS using StateStore#transactional(). If we

remove

StateStore#transactional() and always checkpoint post-commit,
ProcessorStateManager#initializeStoreOffsetsFromCheckpoint will

have

to

determine whether to delete local data. Non-txn implementation

of

StateStore#recover can't detect if it has uncommitted writes.

Since

its

default implementation must always return either true or false,

signaling

whether it is restored into a valid committed-only state. If
StateStore#recover always returns true, we preserve uncommitted

writes

and

violate correctness. Otherwise, ProcessorStateManager#
initializeStoreOffsetsFromCheckpoint would always delete local

data

even

after
a graceful shutdown.


With StateStore#transactional we avoid checkpointing non-txn

state

stores

and prevent that problem during recovery.


Best,

Alex

On Fri, Aug 19, 2022 at 1:05 AM Guozhang Wang <

wangg...@gmail.com>

wrote:

Hello Alex,

Thanks for the replies!

As long as we allow custom user implementations of that

interface,

we

should
probably either keep that flag to distinguish between

transactional

and

non-transactional implementations or change the contract

behind

the

interface. What do you think?

Regarding this question, I thought that in the long run, we

may

always

write checkpoints regardless of txn v.s. non-txn stores, in

which

case

we

would not need that `StateStore#transactional()`. But for now

in

order

for

backward compatibility edge cases we still need to

distinguish

on

whether

or not to write checkpoints. Maybe I was mis-reading its

purposes?

If

yes,

please let me know.


On Mon, Aug 15, 2022 at 7:56 AM Alexander Sorokoumov
<asorokou...@confluent.io.invalid> wrote:

Hey Guozhang,

Thank you for elaborating! I like your idea to introduce a

StreamsConfig

specifically for the default store APIs. You mentioned

Materialized,

but

think changes in StreamJoined follow the same logic.

I updated the KIP and the prototype according to your

suggestions:

* Add a new StoreType and a StreamsConfig for transactional

RocksDB.

* Decide whether Materialized/StreamJoined are

transactional

based

on

the

configured StoreType.
* Move RocksDBTransactionalMechanism to
org.apache.kafka.streams.state.internals to remove it from

the

proposal

scope.
* Add a flag in new Stores methods to configure a state

store

as

transactional. Transactional state stores use the default

transactional

mechanism.
* The changes above allowed to remove all changes to the

StoreSupplier

interface.

I am not sure about marking StateStore#transactional() as

evolving.

As

long

as we allow custom user implementations of that interface,

we

should

probably either keep that flag to distinguish between

transactional

and

non-transactional implementations or change the contract

behind

the

interface. What do you think?

Best,
Alex

On Thu, Aug 11, 2022 at 1:00 AM Guozhang Wang <

wangg...@gmail.com>

wrote:

Hello Alex,

Thanks for the replies. Regarding the global config v.s.

per-store

spec,

agree with John's early comments to some degrees, but I

think

we

may

well

distinguish a couple scenarios here. In sum we are

discussing

about

the

following levels of per-store spec:

* Materialized#transactional()
* StoreSupplier#transactional()
* StateStore#transactional()
* Stores.persistentTransactionalKeyValueStore()...

And my thoughts are the following:

* In the current proposal users could specify

transactional

as

either

"Materialized.as("storeName").withTransantionsEnabled()"

or

"Materialized.as(Stores.persistentTransactionalKeyValueStore(..))",

which

seems not necessary to me. In general, the more options

the

library

provides, the messier for users to learn the new APIs.

* When using built-in stores, users would usually go with
Materialized.as("storeName"). In such cases I feel it's

not

very

meaningful

to specify "some of the built-in stores to be

transactional,

while

others

be non transactional": as long as one of your stores are

non-transactional,

you'd still pay for large restoration cost upon unclean

failure.

People

may, indeed, want to specify if different transactional

mechanisms

to

be

used across stores; but for whether or not the stores

should

be

transactional, I feel it's really an "all or none"

answer,

and

our

built-in

form (rocksDB) should support transactionality for all

store

types.


* When using customized stores, users would usually go

with

Materialized.as(StoreSupplier). And it's possible if

users

would

choose

some to be transactional while others non-transactional

(e.g.

if

their

customized store only supports transactional for some

store

types,

but

not

others).

* At a per-store level, the library do not really care,

or

need

to

know

whether that store is transactional or not at runtime,

except

for

compatibility reasons today we want to make sure the

written

checkpoint

files do not include those non-transactional stores. But

this

check

would

eventually go away as one day we would always checkpoint

files.


---------------------------

With all of that in mind, my gut feeling is that:

* Materialized#transactional(): we would not need this

knob,

since

for

built-in stores I think just a global config should be

sufficient

(see

below), while for customized store users would need to

specify

that

via

the

StoreSupplier anyways and not through this API. Hence I

think

for

either

case we do not need to expose such a knob on the

Materialized

level.


* Stores.persistentTransactionalKeyValueStore(): I think

we

could

refactor

that function without introducing new constructors in the

Stores

factory,

but just add new overloads to the existing func name e.g.

```
persistentKeyValueStore(final String name, final boolean

transactional)

```

Plus we can augment the storeImplType as introduced in

https://cwiki.apache.org/confluence/display/KAFKA/KIP-591%3A+Add+Kafka+Streams+config+to+set+default+state+store

as a syntax sugar for users, e.g.

```
public enum StoreImplType {
     ROCKS_DB,
     TXN_ROCKS_DB,
     IN_MEMORY
   }
```

```

stream.groupByKey().count(Materialized.withStoreType(StoreImplType.TXN_

ROCKS_DB));
```

The above provides this global config at the store impl

type

level.


* RocksDBTransactionalMechanism: I agree with Bruno that

we

would

better

not expose this knob to users, but rather keep it purely

as

an

impl

detail

abstracted from the "TXN_ROCKS_DB" type. Over time we

may,

e.g.

use

in-memory stores as the secondary stores with optional

spill-to-disks

when

we hit the memory limit, but all of that optimizations in

the

future

should

be kept away from the users.

* StoreSupplier#transactional() /

StateStore#transactional():

the

first

flag is only used to be passed into the StateStore layer,

for

indicating

if

we should write checkpoints; we could mark it as

@evolving

so

that

we

can

one day remove it without a long deprecation period.


Guozhang








On Wed, Aug 10, 2022 at 8:04 AM Alexander Sorokoumov
<asorokou...@confluent.io.invalid> wrote:

Hey Guozhang, Bruno,

Thank you for your feedback. I am going to respond to

both

of

you

in

single email. I hope it is okay.

@Guozhang,

We could, instead, have a global

config to specify if the built-in stores should be

transactional

or

not.



This was the original approach I took in this proposal.

Earlier

in

this

thread John, Sagar, and Bruno listed a number of issues

with

it.

tend

to

agree with them that it is probably better user

experience

to

control

transactionality via Materialized objects.

We could simplify our implementation for `commit`

Agreed! I updated the prototype and removed references

to

the

commit

marker

and rolling forward from the proposal.


@Bruno,

So, I would remove the details about the 2-state-store

implementation

from the KIP or provide it as an example of a

possible

implementation

at

the end of the KIP.

I moved the section about the 2-state-store

implementation

to

the

bottom

of

the proposal and always mention it as a reference

implementation.

Please

let me know if this is okay.

Could you please describe the usage of commit() and

recover()

in

the

commit workflow in the KIP as we did in this thread

but

independently

from the state store implementation?


I described how commit/recover change the workflow in

the

Overview

section.


Best,
Alex

On Wed, Aug 10, 2022 at 10:07 AM Bruno Cadonna <

cado...@apache.org

wrote:

Hi Alex,

Thank a lot for explaining!

Now some aspects are clearer to me.

While I understand now, how the state store can roll

forward, I

have

the

feeling that rolling forward is specific to the

2-state-store

implementation with RocksDB of your PoC. Other state

store

implementations might use a different strategy to

react

to

crashes.

For

example, they might apply an atomic write and

effectively

rollback

if

they crash before committing the state store

transaction. I

think

the

KIP should not contain such implementation details

but

provide

an

interface to accommodate rolling forward and rolling

backward.


So, I would remove the details about the

2-state-store

implementation

from the KIP or provide it as an example of a

possible

implementation

at

the end of the KIP.

Since a state store implementation can roll forward

or

roll

back, I

think it is fine to return the changelog offset from

recover().

With

the

returned changelog offset, Streams knows from where

to

start

state

store

restoration.

Could you please describe the usage of commit() and

recover()

in

the

commit workflow in the KIP as we did in this thread

but

independently

from the state store implementation? That would make

things

clearer.

Additionally, descriptions of failure scenarios would

also

be

helpful.


Best,
Bruno


On 04.08.22 16:39, Alexander Sorokoumov wrote:

Hey Bruno,

Thank you for the suggestions and the clarifying

questions. I

believe

that

they cover the core of this proposal, so it is

crucial

for

us

to

be

on

the

same page.

1. Don't you want to deprecate StateStore#flush().


Good call! I updated both the proposal and the

prototype.


   2. I would shorten

Materialized#withTransactionalityEnabled()

to

Materialized#withTransactionsEnabled().



Turns out, these methods are no longer necessary. I

removed

them

from

the

proposal and the prototype.

3. Could you also describe a bit more in detail

where

the

offsets

passed

into commit() and recover() come from?



The offset passed into StateStore#commit is the

last

offset

committed

to

the changelog topic. The offset passed into

StateStore#recover

is

the

last

checkpointed offset for the given StateStore. Let's

look

at

steps 3

and 4

in the commit workflow. After the

TaskExecutor/TaskManager

commits,

it

calls

StreamTask#postCommit[1] that in turn:
a. updates the changelog offsets via
ProcessorStateManager#updateChangelogOffsets[2].

The

offsets

here

come

from

the RecordCollector[3], which tracks the latest

offsets

the

producer

sent

without exception[4, 5].
b. flushes/commits the state store in

AbstractTask#maybeCheckpoint[6].

This

method essentially calls ProcessorStateManager

methods

flush/commit[7]

and checkpoint[8]. ProcessorStateManager#commit

goes

over

all

state

stores

that belong to that task and commits them with the

offset

obtained

in

step

`a`. ProcessorStateManager#checkpoint writes down

those

offsets

for

all

state stores, except for non-transactional ones in

the

case

of

EOS.


During initialization, StreamTask calls
StateManagerUtil#registerStateStores[8] that in

turn

calls

ProcessorStateManager#initializeStoreOffsetsFromCheckpoint[9].

At

the

moment, this method assigns checkpointed offsets to

the

corresponding

state

stores[10]. The prototype also calls

StateStore#recover

with

the

checkpointed offset and assigns the offset returned

by

recover()[11].


4. I do not quite understand how a state store can

roll

forward.

You

mention in the thread the following:



The 2-state-stores commit looks like this [12]:

     1. Flush the temporary state store.
     2. Create a commit marker with a changelog

offset

corresponding

to

the

     state we are committing.
     3. Go over all keys in the temporary store and

write

them

down

to

the

     main one.
     4. Wipe the temporary store.
     5. Delete the commit marker.


Let's consider crash failure scenarios:

     - Crash failure happens between steps 1 and 2.

The

main

state

store

is

     in a consistent state that corresponds to the

previously

checkpointed

     offset. StateStore#recover throws away the

temporary

store

and

proceeds

     from the last checkpointed offset.
     - Crash failure happens between steps 2 and 3.

We

do

not

know

what

keys

     from the temporary store were already written

to

the

main

store,

so

we

     can't roll back. There are two options - either

wipe

the

main

store

or roll

     forward. Since the point of this proposal is to

avoid

situations

where we

     throw away the state and we do not care to what

consistent

state

the

store

     rolls to, we roll forward by continuing from

step

3.

     - Crash failure happens between steps 3 and 4.

We

can't

distinguish

     between this and the previous scenario, so we

write

all

the

keys

from the

     temporary store. This is okay because the

operation

is

idempotent.

     - Crash failure happens between steps 4 and 5.

Again,

we

can't

     distinguish between this and previous

scenarios,

but

the

temporary

store is

     already empty. Even though we write all keys

from

the

temporary

store, this

     operation is, in fact, no-op.
     - Crash failure happens between step 5 and

checkpoint.

This

is

the

case

     you referred to in question 5. The commit is

finished,

but

it

is

not

     reflected at the checkpoint. recover() returns

the

offset

of

the

previous

     commit here, which is incorrect, but it is okay

because

we

will

replay the

     changelog from the previously committed offset.

As

changelog

replay

is

     idempotent, the state store recovers into a

consistent

state.


The last crash failure scenario is a natural

transition

to


how should Streams know what to write into the

checkpoint

file

after the crash?


As mentioned above, the Streams app writes the

checkpoint

file

after

the

Kafka transaction and then the StateStore commit.

Same

as

without

the

proposal, it should write the committed offset, as

it

is

the

same

for

both

the Kafka changelog and the state store.

This issue arises because we store the offset

outside

of

the

state

store. Maybe we need an additional method on the

state

store

interface

that returns the offset at which the state store

is.



In my opinion, we should include in the interface

only

the

guarantees

that

are necessary to preserve EOS without wiping the

local

state.

This

way,

we

allow more room for possible implementations.

Thanks

to

the

idempotency

of

the changelog replay, it is "good enough" if

StateStore#recover

returns

the

offset that is less than what it actually is. The

only

limitation

here

is

that the state store should never commit writes

that

are

not

yet

committed

in Kafka changelog.

Please let me know what you think about this. First

of

all, I

am

relatively

new to the codebase, so I might be wrong in my

understanding

of

how it works. Second, while writing this, it

occured

to

me

that

the

StateStore#recover interface method is not

straightforward

as

it

can

be.

Maybe we can change it like that:

/**
      * Recover a transactional state store
      * <p>
      * If a transactional state store shut down

with

crash

failure,

this

method ensures that the
      * state store is in a consistent state that

corresponds

to

{@code

changelofOffset} or later.
      *
      * @param changelogOffset the checkpointed

changelog

offset.

      * @return {@code true} if recovery succeeded,

{@code

false}

otherwise.

      */
boolean recover(final Long changelogOffset) {

Note: all links below except for [10] lead to the

prototype's

code.

1.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L468

2.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L580

3.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L868

4.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorStateManager.java#L94-L96

5.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/RecordCollectorImpl.java#L213-L216

6.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/AbstractTask.java#L94-L97

7.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorStateManager.java#L469

8.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L226

9.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StateManagerUtil.java#L103

10.

https://github.com/apache/kafka/blob/0c4da23098f8b8ae9542acd7fbaa1e5c16384a39/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorStateManager.java#L251-L252

11.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorStateManager.java#L250-L265

12.

https://github.com/apache/kafka/blob/549e54be95a8e1bae1e97df2c21d48c042ff356e/streams/src/main/java/org/apache/kafka/streams/state/internals/AbstractTransactionalStore.java#L84-L88


Best,
Alex

On Fri, Jul 29, 2022 at 3:42 PM Bruno Cadonna <

cado...@apache.org>

wrote:

Hi Alex,

Thanks for the updates!

1. Don't you want to deprecate StateStore#flush().

As

far

as I

understand, commit() is the new flush(), right? If

you

do

not

deprecate

it, you don't get rid of the error room you

describe

in

your

KIP

by

having a flush() and a commit().


2. I would shorten

Materialized#withTransactionalityEnabled()

to

Materialized#withTransactionsEnabled().


3. Could you also describe a bit more in detail

where

the

offsets

passed

into commit() and recover() come from?


For my next two points, I need the commit workflow

that

you

were

so

kind

to post into this thread:

1. write stuff to the state store
2. producer.sendOffsetsToTransaction(token);

producer.commitTransaction();

3. flush (<- that would be call to commit(),

right?)

4. checkpoint


4. I do not quite understand how a state store can

roll

forward.

You

mention in the thread the following:

"If the crash failure happens during #3, the state

store

can

roll

forward and finish the flush/commit."

How does the state store know where it stopped the

flushing

when

it

crashed?

This seems an optimization to me. I think in

general

the

state

store

should rollback to the last successfully committed

state

and

restore

from there until the end of the changelog topic

partition.

The

last

committed state is the offsets in the checkpoint

file.



5. In the same e-mail from point 4, you also

state:


"If the crash failure happens between #3 and #4,

the

state

store

should

do nothing during recovery and just proceed with

the

checkpoint."


How should Streams know that the failure was

between

#3

and

#4

during

recovery? It just sees a valid state store and a

valid

checkpoint

file.

Streams does not know that the state of the

checkpoint

file

does

not

match with the committed state of the state store.
Also, how should Streams know what to write into

the

checkpoint

file

after the crash?
This issue arises because we store the offset

outside

of

the

state

store. Maybe we need an additional method on the

state

store

interface

that returns the offset at which the state store

is.



Best,
Bruno




On 27.07.22 11:51, Alexander Sorokoumov wrote:

Hey Nick,

Thank you for the kind words and the feedback!

I'll

definitely

add

an

option to configure the transactional mechanism

in

Stores

factory

method

via an argument as John previously suggested and

might

add

the

in-memory

option via RocksDB Indexed Batches if I figure

why

their

creation

via

rocksdb jni fails with

`UnsatisfiedLinkException`.


Best,
Alex

On Wed, Jul 27, 2022 at 11:46 AM Alexander

Sorokoumov <

asorokou...@confluent.io> wrote:

Hey Guozhang,

1) About the param passed into the `recover()`

function:

it

seems

to

me

that the semantics of "recover(offset)" is:

recover

this

state

to a

transaction boundary which is at least the

passed-in

offset.

And

the

only

possibility that the returned offset is

different

than

the

passed-in

offset
is that if the previous failure happens after

we've

done

all

the

commit

procedures except writing the new checkpoint,

in

which

case

the

returned

offset would be larger than the passed-in

offset.

Otherwise

it

should

always be equal to the passed-in offset, is

that

right?



Right now, the only case when `recover` returns

an

offset

different

from

the passed one is when the failure happens

*during*

commit.



If the failure happens after commit but before

the

checkpoint,

`recover`

might return either a passed or newer committed

offset,

depending

on

the

implementation. The `recover` implementation in

the

prototype

returns

passed offset because it deletes the commit

marker

that

holds

that

offset

after the commit is done. In that case, the

store

will

replay

the

last

commit from the changelog. I think it is fine as

the

changelog

replay

is

idempotent.

2) It seems the only use for the

"transactional()"

function

is

to

determine

if we can update the checkpoint file while in

EOS.



Right now, there are 2 other uses for

`transactional()`:

1. To determine what to do during initialization

if

the

checkpoint

is

gone

(see [1]). If the state store is transactional,

we

don't

have

to

wipe

the

existing data. Thinking about it now, we do not

really

need

this

check

whether the store is `transactional` because if

it

is

not,

we'd

not

have

written the checkpoint in the first place. I am

going

to

remove

that

check.

2. To determine if the persistent kv store in

KStreamImplJoin

should

be

transactional (see [2], [3]).

I am not sure if we can get rid of the checks in

point

2.

If

so,

I'd

be

happy to encapsulate `transactional()` logic in

`commit/recover`.


Best,
Alex

1.

https://github.com/apache/kafka/pull/12393/files#diff-971d9ef7ea8aefffff687fc7ee131bd166ced94445f4ab55aa83007541dccfdaL256-R281

2.

https://github.com/apache/kafka/pull/12393/files#diff-9ce43046fdef1233ab762e728abd1d3d44d7c270b28dcf6b63aa31a93a30af07R266-R278

3.

https://github.com/apache/kafka/pull/12393/files#diff-9ce43046fdef1233ab762e728abd1d3d44d7c270b28dcf6b63aa31a93a30af07R348-R354


On Tue, Jul 26, 2022 at 6:39 PM Nick Telford <

nick.telf...@gmail.com>

wrote:

Hi Alex,

Excellent proposal, I'm very keen to see this

land!


Would it be useful to permit configuring the

type

of

store

used

for

uncommitted offsets on a store-by-store basis?

This

way,

users

could

choose
whether to use, e.g. an in-memory store or

RocksDB,

potentially

reducing

the overheads associated with RocksDb for

smaller

stores,

but

without

the

memory pressure issues?

I suspect that in most cases, the number of

uncommitted

records

will

be

very small, because the default commit interval

is

100ms.


Regards,

Nick

On Tue, 26 Jul 2022 at 01:36, Guozhang Wang <

wangg...@gmail.com>

wrote:

Hello Alex,

Thanks for the updated KIP, I looked over it

and

browsed

the

WIP

and

just

have a couple meta thoughts:

1) About the param passed into the `recover()`

function:

it

seems

to

me

that the semantics of "recover(offset)" is:

recover

this

state

to

transaction boundary which is at least the

passed-in

offset.

And

the

only

possibility that the returned offset is

different

than

the

passed-in

offset

is that if the previous failure happens after

we've

done

all

the

commit

procedures except writing the new checkpoint,

in

which

case

the

returned

offset would be larger than the passed-in

offset.

Otherwise

it

should

always be equal to the passed-in offset, is

that

right?


2) It seems the only use for the

"transactional()"

function

is

to

determine

if we can update the checkpoint file while in

EOS.

But

the

purpose

of

the

checkpoint file's offsets is just to tell "the

local

state's

current

snapshot's progress is at least the indicated

offsets"

anyways,

and

with

this KIP maybe we would just do:

a) when in ALOS, upon failover: we set the

starting

offset

as

checkpointed-offset, then restore() from

changelog

till

the

end-offset.

This way we may restore some records twice.
b) when in EOS, upon failover: we first call

recover(checkpointed-offset),

then set the starting offset as the returned

offset

(which

may

be

larger

than checkpointed-offset), then restore until

the

end-offset.


So why not also:
c) we let the `commit()` function to also

return

an

offset,

which

indicates

"checkpointable offsets".
d) for existing non-transactional stores, we

just

have a

default

implementation of "commit()" which is simply a

flush,

and

returns

sentinel value like -1. Then later if we get

checkpointable

offsets

-1,

we

do not write the checkpoint. Upon clean

shutting

down

we

can

just

checkpoint regardless of the returned value

from

"commit".

e) for existing non-transactional stores, we

just

have a

default

implementation of "recover()" which is to wipe

out

the

local

store

and

return offset 0 if the passed in offset is -1,

otherwise

if

not

-1

then

it

indicates a clean shutdown in the last run,

can

this

function

is

just

no-op.

In that case, we would not need the

"transactional()"

function

anymore,

since for non-transactional stores their

behaviors

are

still

wrapped

in

the

`commit / recover` function pairs.

I have not completed the thorough pass on your

WIP

PR,

so

maybe

could

come up with some more feedback later, but

just

let

me

know

if

my

understanding above is correct or not?


Guozhang




On Thu, Jul 14, 2022 at 7:01 AM Alexander

Sorokoumov

<asorokou...@confluent.io.invalid> wrote:

Hi,

I updated the KIP with the following changes:
* Replaced in-memory batches with the

secondary-store

approach

as

the

default implementation to address the

feedback

about

memory

pressure

as

suggested by Sagar and Bruno.
* Introduced StateStore#commit and

StateStore#recover

methods

as

an

extension of the rollback idea. @Guozhang,

please

see

the

comment

below

on

why I took a slightly different approach than

you

suggested.

* Removed mentions of changes to IQv1 and

IQv2.

Transactional

state

stores

enable reading committed in IQ, but it is

really

an

independent

feature

that deserves its own KIP. Conflating them

unnecessarily

increases

the

scope for discussion, implementation, and

testing

in

single

unit

of

work.


I also published a prototype -

https://github.com/apache/kafka/pull/12393

that implements changes described in the

proposal.


Regarding explicit rollback, I think it is a

powerful

idea

that

allows

other StateStore implementations to take a

different

path

to

the

transactional behavior rather than keep 2

state

stores.

Instead

of

introducing a new commit token, I suggest

using a

changelog

offset

that

already 1:1 corresponds to the materialized

state.

This

works

nicely

because Kafka Stream first commits an AK

transaction

and

only

then

checkpoints the state store, so we can use

the

changelog

offset

to

commit

the state store transaction.

I called the method StateStore#recover rather

than

StateStore#rollback

because a state store might either roll back

or

forward

depending

on

the

specific point of the crash failure.Consider

the

write

algorithm

in

Kafka

Streams is:
1. write stuff to the state store
2. producer.sendOffsetsToTransaction(token);

producer.commitTransaction();

3. flush
4. checkpoint

Let's consider 3 cases:
1. If the crash failure happens between #2

and

#3,

the

state

store

rolls

back and replays the uncommitted transaction

from

the

changelog.

2. If the crash failure happens during #3,

the

state

store

can

roll

forward

and finish the flush/commit.
3. If the crash failure happens between #3

and

#4,

the

state

store

should

do nothing during recovery and just proceed

with

the

checkpoint.


Looking forward to your feedback,
Alexander

On Wed, Jun 8, 2022 at 12:16 AM Alexander

Sorokoumov

asorokou...@confluent.io> wrote:

Hi,

As a status update, I did the following

changes

to

the

KIP:

* replaced configuration via the top-level

config

with

configuration

via

Stores factory and StoreSuppliers,
* added IQv2 and elaborated how

readCommitted

will

work

when

the

store

is

not transactional,
* removed claims about ALOS.

I am going to be OOO in the next couple of

weeks

and

will

resume

working

on the proposal and responding to the

discussion

in

this

thread

starting

June 27. My next top priorities are:
1. Prototype the rollback approach as

suggested

by

Guozhang.

2. Replace in-memory batches with the

secondary-store

approach

as

the

default implementation to address the

feedback

about

memory

pressure as

suggested by Sagar and Bruno.
3. Adjust Stores methods to make

transactional

implementations

pluggable.

4. Publish the POC for the first review.

Best regards,
Alex

On Wed, Jun 1, 2022 at 2:52 PM Guozhang

Wang <

wangg...@gmail.com>

wrote:

Alex,

Thanks for your replies! That is very

helpful.


Just to broaden our discussions a bit

here, I

think

there

are

some

other

approaches in parallel to the idea of

"enforce

to

only

persist

upon

explicit flush" and I'd like to throw one

here

--

not

really

advocating

it,
but just for us to compare the pros and

cons:


1) We let the StateStore's `flush` function

to

return a

token

instead

of

returning `void`.
2) We add another `rollback(token)`

interface

of

StateStore

which

would

effectively rollback the state as indicated

by

the

token

to

the

snapshot

when the corresponding `flush` is called.
3) We encode the token and commit as part

of

`producer#sendOffsetsToTransaction`.

Users could optionally implement the new

functions,

or

they

can

just

not

return the token at all and not implement

the

second

function.

Again,

the

APIs are just for the sake of illustration,

not

feeling

they

are

the

most

natural :)

Then the procedure would be:

1. the previous checkpointed offset is 100
...
3. flush store, make sure all writes are

persisted;

get

the

returned

token

that indicates the snapshot of 200.
4.

producer.sendOffsetsToTransaction(token);

producer.commitTransaction();

5. Update the checkpoint file (say, the new

value

is

200).


Then if there's a failure, say between 3/4,

we

would

get

the

token

from

the
last committed txn, and first we would do

the

restoration

(which

may

get

the state to somewhere between 100 and

200),

then

call

`store.rollback(token)` to rollback to the

snapshot

of

offset

100.


The pros is that we would then not need to

enforce

the

state

stores to

not

persist any data during the txn: for stores

that

may

not

be

able

to

implement the `rollback` function, they can

still

reduce

its

impl

to

"not

persisting any data" via this API, but for

stores

that

can

indeed

support

the rollback, their implementation may be

more

efficient.

The

cons

though,

on top of my head are 1) more complicated

logic

differentiating

between

EOS
with and without store rollback support,

and

ALOS,

2)

encoding

the

token

as
part of the commit offset is not ideal if

it

is

big,

3)

the

recovery

logic

including the state store is also a bit

more

complicated.



Guozhang





On Wed, Jun 1, 2022 at 1:29 PM Alexander

Sorokoumov

<asorokou...@confluent.io.invalid> wrote:

Hi Guozhang,

But I'm still trying to clarify how it

guarantees

EOS,

and

it

seems

that we

would achieve it by enforcing to not

persist

any

data

written

within

this

transaction until step 4. Is that

correct?



This is correct. Both alternatives -

in-memory

WriteBatchWithIndex

and

transactionality via the secondary store

guarantee

EOS

by

not

persisting

data in the "main" state store until it is

committed

in

the

changelog

topic.

Oh what I meant is not what KStream code

does,

but

that

StateStore

impl

classes themselves could potentially

flush

data

to

become

persisted

asynchronously



Thank you for elaborating! You are

correct,

the

underlying

state

store

should not persist data until the streams

app

calls

StateStore#flush.

There

are 2 options how a State Store

implementation

can

guarantee

that -

either

keep uncommitted writes in memory or be

able

to

roll

back

the

changes

that

were not committed during recovery.

RocksDB's

WriteBatchWithIndex is

an

implementation of the first option. A

considered

alternative,

Transactions

via Secondary State Store for Uncommitted

Changes,

is

the

way

to

implement

the second option.

As everyone correctly pointed out, keeping

uncommitted

data

in

memory

introduces a very real risk of OOM that we

will

need

to

handle.

The

more I

think about it, the more I lean towards

going

with

the

Transactions

via

Secondary Store as the way to implement

transactionality

as

it

does

not

have that issue.

Best,
Alex


On Wed, Jun 1, 2022 at 12:59 PM Guozhang

Wang

wangg...@gmail.com>

wrote:

Hello Alex,

we flush the cache, but not the

underlying

state

store.


You're right. The ordering I mentioned

above

is

actually:


...
3. producer.sendOffsetsToTransaction();

producer.commitTransaction();

4. flush store, make sure all writes are

persisted.

5. Update the checkpoint file to 200.

But I'm still trying to clarify how it

guarantees

EOS,

and

it

seems

that

we

would achieve it by enforcing to not

persist

any

data

written

within

this

transaction until step 4. Is that

correct?

Can you please point me to the place in

the

codebase

where

we

trigger

async flush before the commit?

Oh what I meant is not what KStream code

does,

but

that

StateStore

impl

classes themselves could potentially

flush

data

to

become

persisted

asynchronously, e.g. RocksDB does that

naturally

out

of

the

control

of

KStream code. I think it is related to my

previous

question:

if we

think

by

guaranteeing EOS at the state store

level,

we

would

effectively

ask

the

impl classes that "you should not persist

any

data

until

`flush`

is

called

explicitly", is the StateStore interface

the

right

level

to

enforce

such

mechanisms, or should we just do that on

top

of

the

StateStores,

e.g.

during the transaction we just keep all

the

writes

in

the

cache

(of

course

we need to consider how to work around

memory

pressure

as

previously

mentioned), and then upon committing, we

just

write

the

cached

records

as a

whole into the store and then call flush.


Guozhang







On Tue, May 31, 2022 at 4:08 PM Alexander

Sorokoumov

<asorokou...@confluent.io.invalid>

wrote:

Hey,

Thank you for the wealth of great

suggestions

and

questions!

am

going

to

address the feedback in batches and

update

the

proposal

async,

as

it is

probably going to be easier for

everyone.

will

also

write

separate

message after making updates to the KIP.

@John,

Did you consider instead just adding

the

option

to

the

RocksDB*StoreSupplier classes and the

factories

in

Stores ?


Thank you for suggesting that. I think

that

this

idea

is

better

than

what I

came up with and will update the KIP

with

configuring

transactionality

via

the suppliers and Stores.

what is the advantage over just doing

the

same

thing

with

the

RecordCache

and not introducing the WriteBatch at

all?


Can you point me to RecordCache? I can't

find

it

in

the

project.

The

advantage would be that WriteBatch

guarantees

write

atomicity.

As

far

as

understood the way RecordCache works, it

might

leave

the

system

in

an

inconsistent state during crash failure

on

write.


You mentioned that a transactional store

can

help

reduce

duplication in

the

case of ALOS


I will remove claims about ALOS from the

proposal.

Thank

you

for

elaborating!

As a reminder, we have a new IQv2

mechanism

now.

Should

we

propose

any

changes to IQv1 to support this

transactional

mechanism,

versus

just

proposing it for IQv2? Certainly, it

seems

strange

only

to

propose a

change

for IQv1 and not v2.



    I will update the proposal with

complementary

API

changes

for

IQv2


What should IQ do if I request to

readCommitted

on a

non-transactional

store?


We can assume that non-transactional

stores

commit

on

write,

so

IQ

works

in

the same way with non-transactional

stores

regardless

of

the

value

of

readCommitted.


    @Guozhang,

* If we crash between line 3 and 4, then

at

that

time

the

local

persistent

store image is representing as of

offset

200,

but

upon

recovery

all

changelog records from 100 to

log-end-offset

would

be

considered

as

aborted

and not be replayed and we would

restart

processing

from

position

100.

Restart processing will violate EOS.I'm

not

sure

how

e.g.

RocksDB's

WriteBatchWithIndex would make sure

that

the

step 4

and

step 5

could

be

done atomically here.



Could you please point me to the place

in

the

codebase

where

task

flushes

the store before committing the

transaction?

Looking at TaskExecutor (

https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/processor/internals/TaskExecutor.java#L144-L167

),
StreamTask#prepareCommit (

https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java#L398

),
and CachedStateStore (

https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/state/internals/CachedStateStore.java#L29-L34

)
we flush the cache, but not the

underlying

state

store.

Explicit

StateStore#flush happens in

AbstractTask#maybeWriteCheckpoint (

https://github.com/apache/kafka/blob/4c9eeef5b2dff9a4f0977fbc5ac7eaaf930d0d0e/streams/src/main/java/org/apache/kafka/streams/processor/internals/AbstractTask.java#L91-L99

).
Is there something I am missing here?

Today all cached data that have not been

flushed

are

not

committed

for

sure, but even flushed data to the

persistent

underlying

store

may

also

be

uncommitted since flushing can be

triggered

asynchronously

before

the

commit.


Can you please point me to the place in

the

codebase

where

we

trigger

async

flush before the commit? This would

certainly

be a

reason

to

introduce

dedicated StateStore#commit method.

Thanks again for the feedback. I am

going

to

update

the

KIP

and

then

respond to the next batch of questions

and

suggestions.


Best,
Alex

On Mon, May 30, 2022 at 5:13 PM Suhas

Satish

<ssat...@confluent.io.invalid

wrote:

Thanks for the KIP proposal Alex.
1. Configuration default

You mention applications using streams

DSL

with

built-in

rocksDB

state

store will get transactional state

stores

by

default

when

EOS

is

enabled,

but the default implementation for apps

using

PAPI

will

fallback

to

non-transactional behavior.
Shouldn't we have the same default

behavior

for

both

types

of

apps -

DSL

and PAPI?

On Mon, May 30, 2022 at 2:11 AM Bruno

Cadonna <

cado...@apache.org

wrote:

Thanks for the PR, Alex!

I am also glad to see this coming.


1. Configuration

I would also prefer to restrict the

configuration

of

transactional

on

the state sore. Ideally, calling

method

transactional()

on

the

state

store would be enough. An option on

the

store

builder

would

make it

possible to turn transactionality on

and

off

(as

John

proposed).



2. Memory usage in RocksDB

This seems to be a major issue. We do

not

have

any

guarantee

that

uncommitted writes fit into memory

and I

guess

we

will

never

have.

What

happens when the uncommitted writes do

not

fit

into

memory?

Does

RocksDB

throw an exception? Can we handle such

an

exception

without

crashing?


Does the RocksDB behavior even need to

be

included

in

this

KIP?

In

the

end it is an implementation detail.

What we should consider - though - is

memory

limit

in

some

form.

And

what we do when the memory limit is

exceeded.



3. PoC

I agree with Guozhang that a PoC is a

good

idea

to

better

understand

the

devils in the details.


Best,
Bruno

On 25.05.22 01:52, Guozhang Wang

wrote:

Hello Alex,

Thanks for writing the proposal! Glad

to

see

it

coming. I

think

this

is

the

kind of a KIP that since too many

devils

would

be

buried

in

the

details

and

it's better to start working on a

POC,

either

in

parallel,

or

before

we

resume our discussion, rather than

blocking

any

implementation

until

we

are

satisfied with the proposal.

Just as a concrete example, I

personally

am

still

not

100%

clear

how

the

proposal would work to achieve EOS

with

the

state

stores.

For

example,

the

commit procedure today looks like

this:


0: there's an existing checkpoint

file

indicating

the

changelog

offset

of

the local state store image is 100.

Now a

commit

is

triggered:

1. flush cache (since it contains

partially

processed

records),

make

sure

all records are written to the

producer.

2. flush producer, making sure all

changelog

records

have

now

acked.

//

here we would get the new changelog

position,

say

3. flush store, make sure all writes

are

persisted.

4.

producer.sendOffsetsToTransaction();

producer.commitTransaction();

//

we

would make the writes in changelog up

to

offset

committed

5. Update the checkpoint file to 200.

The question about atomicity between

those

lines,

for

example:


* If we crash between line 4 and line

5,

the

local

checkpoint

file

would

stay as 100, and upon recovery we

would

replay

the

changelog

from

to

200. This is not ideal but does not

violate

EOS,

since

the

changelogs

are

all overwrites anyways.
* If we crash between line 3 and 4,

then

at

that

time

the

local

persistent

store image is representing as of

offset

200,

but

upon

recovery

all

changelog records from 100 to

log-end-offset

would

be

considered

as

aborted

and not be replayed and we would

restart

processing

from

position

100.

Restart processing will violate

EOS.I'm

not

sure

how

e.g.

RocksDB's

WriteBatchWithIndex would make sure

that

the

step 4

and

step 5

could

be

done atomically here.

Originally what I was thinking when

creating

the

JIRA

ticket

is

that

we

need to let the state store to

provide

transactional

API

like

"token

commit()" used in step 4) above which

returns a

token,

that

e.g.

in

our

example above indicates offset 200,

and

that

token

would

be

written

as

part

of the records in Kafka transaction

in

step

5).

And

upon

recovery

the

state

store would have another API like

"rollback(token)"

where

the

token

is

read

from the latest committed txn, and be

used

to

rollback

the

store

to

that

committed image. I think your

proposal

is

different,

and

it

seems

like

you're proposing we swap step 3) and

4)

above,

but

the

atomicity

issue

still remains since now you may have

the

store

image

at

but

the

changelog is committed at 200. I'd

like

to

learn

more

about

the

details

on how it resolves such issues.

Anyways, that's just an example to

make

the

point

that

there

are

lots

of

implementational details which would

drive

the

public

API

design,

and

we

should probably first do a POC, and

come

back

to

discuss

the

KIP.

Let

me

know what you think?


Guozhang









On Tue, May 24, 2022 at 10:35 AM

Sagar

sagarmeansoc...@gmail.com>

wrote:

Hi Alexander,

Thanks for the KIP! This seems like

great

proposal.

have

the

same

opinion as John on the Configuration

part

though.

think

the 2

level

config and its behaviour based on

the

setting/unsetting

of

the

flag

seems

confusing to me as well. Since the

KIP

seems

specifically

centred

around

RocksDB it might be better to add it

at

the

Supplier

level

as

John

suggested.

On similar lines, this config name

=>

*statestore.transactional.mechanism

*may
also need rethinking as the value

assigned

to

it(rocksdb_indexbatch)

implicitly seems to assume that

rocksdb

is

the

only

statestore

that

Kafka

Stream supports while that's not the

case.


Also, regarding the potential memory

pressure

that

can be

introduced

by

WriteBatchIndex, do you think it

might

make

more

sense to

include

some

numbers/benchmarks on how much the

memory

consumption

might

increase?


Lastly, the read_uncommitted flag's

behaviour

on

IQ

may

need

more

elaboration.

These points aside, as I said, this

is a

great

proposal!


Thanks!
Sagar.

On Tue, May 24, 2022 at 10:35 PM

John

Roesler

vvcep...@apache.org>

wrote:

Thanks for the KIP, Alex!

I'm really happy to see your

proposal.

This

improvement

fills a

long-standing gap.

I have a few questions:

1. Configuration
The KIP only mentions RocksDB, but

of

course,

Streams

also

ships

with

an

InMemory store, and users also plug

in

their

own

custom

state

stores.

It

is

also common to use multiple types

of

state

stores

in

the

same

application

for different purposes.

Against this backdrop, the choice

to

configure

transactionality

as

top-level config, as well as to

configure

the

store

transaction

mechanism

as a top-level config, seems a bit

off.


Did you consider instead just

adding

the

option

to

the

RocksDB*StoreSupplier classes and

the

factories

in

Stores

It

seems

like

the desire to enable the feature by

default,

but

with a

feature-flag

to

disable it was a factor here.

However,

as

you

pointed

out,

there

are

some

major considerations that users

should

be

aware

of,

so

opt-in

doesn't

seem

like a bad choice, either. You

could

add

an

Enum

argument

to

those

factories like

`RocksDBTransactionalMechanism.{NONE,


Some points in favor of this

approach:

* Avoid "stores that don't support

transactions

ignore

the

config"

complexity
* Users can choose how to spend

their

memory

budget,

making

some

stores

transactional and others not
* When we add transactional support

to

in-memory

stores,

we

don't

have

to

figure out what to do with the

mechanism

config

(i.e.,

what

do

you

set

the

mechanism to when there are

multiple

kinds

of

transactional

stores

in

the

topology?)

2. caching/flushing/transactions
The coupling between memory usage

and

flushing

that

you

mentioned

is

bit

troubling. It also occurs to me

that

there

seems

to

be

some

relationship

with the existing record cache,

which

is

also

an

in-memory

holding

area

for

records that are not yet written to

the

cache

and/or

store

(albeit

with

no

particular semantics). Have you

considered

how

all

these

components

should

relate? For example, should a

"full"

WriteBatch

actually

trigger

flush

so

that we don't get OOMEs? If the

proposed

transactional

mechanism

forces

all

uncommitted writes to be buffered

in

memory,

until

commit,

then

what

is

the advantage over just doing the

same

thing

with

the

RecordCache

and

not

introducing the WriteBatch at all?

3. ALOS
You mentioned that a transactional

store

can

help

reduce

duplication

in

the case of ALOS. We might want to

be

careful

about

claims

like

that.

Duplication isn't the way that

repeated

processing

manifests in

state

stores. Rather, it is in the form

of

dirty

reads

during

reprocessing.

This

feature may reduce the incidence of

dirty

reads

during

reprocessing,

but

not in a predictable way. During

regular

processing

today,

we

will

send

some records through to the

changelog

in

between

commit

intervals.

Under

ALOS, if any of those dirty writes

gets

committed

to

the

changelog

topic,

then upon failure, we have to roll

the

store

forward

to

them

anyway,

regardless of this new

transactional

mechanism.

That's a

fixable

problem,

by the way, but this KIP doesn't

seem

to

fix

it.

wonder

if we

should

make

any claims about the relationship

of

this

feature

to

ALOS

if

the

real-world

behavior is so complex.

4. IQ
As a reminder, we have a new IQv2

mechanism

now.

Should

we

propose

any

changes to IQv1 to support this

transactional

mechanism,

versus

just

proposing it for IQv2? Certainly,

it

seems

strange

only

to

propose

change

for IQv1 and not v2.

Regarding your proposal for IQv1,

I'm

unsure

what

the

behavior

should

be

for readCommitted, since the

current

behavior

also

reads

out of

the

RecordCache. I guess if

readCommitted==false,

then

we

will

continue

to

read

from the cache first, then the

Batch,

then

the

store;

and

if

readCommitted==true, we would skip

the

cache

and

the

Batch

and

only

read

from the persistent RocksDB store?

What should IQ do if I request to

readCommitted

on

non-transactional

store?

Thanks again for proposing the KIP,

and

my

apologies

for

the

long

reply;

I'm hoping to air all my concerns

in

one

"batch"

to

save

time

for

you.


Thanks,
-John

On Tue, May 24, 2022, at 03:45,

Alexander

Sorokoumov

wrote:

Hi all,

I've written a KIP for making

Kafka

Streams

state

stores

transactional

and

would like to start a discussion:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-844%3A+Transactional+State+Stores


Best,
Alex



--

[image: Confluent] <

https://www.confluent.io

Suhas Satish
Engineering Manager
Follow us: [image: Blog]
<

https://www.confluent.io/blog?utm_source=footer&utm_medium=email&utm_campaign=ch.email-signature_type.community_content.blog

[image:

Twitter] <

https://twitter.com/ConfluentInc

[image:

LinkedIn]

https://www.linkedin.com/company/confluent/


[image: Try Confluent Cloud for Free]
<

https://www.confluent.io/get-started?utm_campaign=tm.fm-apac_cd.inbound&utm_source=gmail&utm_medium=organic



--
-- Guozhang



--
-- Guozhang



--
-- Guozhang



--
-- Guozhang



--
-- Guozhang



--
-- Guozhang

Re: [DISCUSS] KIP-844: Transactional State Stores

Reply via email to