Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

Matthias J. Sax Mon, 22 Apr 2024 23:15:16 -0700

Thanks for splitting out this KIP. The discussion shows, that it is acomplex beast by itself, so worth to discuss by its own.


Couple of question / comment:

100 `StateStore#commit()`: The JavaDoc says "must not be called byusers" -- I would propose to put a guard in place for this, by eitherthrowing an exception (preferable) or adding a no-op implementation (atleast for our own stores, by wrapping them -- we cannot enforce it forcustom stores I assume), and document this contract explicitly.

101 adding `.position` to the store: Why do we actually need this? TheKIP says "To ensure consistency with the committed data and changelogoffsets" but I am not sure if I can follow? Can you elaborate whyleaving the `.position` file as-is won't work?

If it's possible at all, it will need to be done by
creating temporary StateManagers and StateStores during rebalance. I think
it is possible, and probably not too expensive, but the devil will be in
the detail.

This sounds like a significant overhead to me. We know that opening asingle RocksDB takes about 500ms, and thus opening RocksDB to get thisinformation might slow down rebalances significantly.

102: It's unclear to me, how `.position` information is added. The KIPonly says: "position offsets will be stored in RocksDB, in the samecolumn family as the changelog offsets". Do you intent to add thisinformation to the map passed via `commit(final Map<TopicPartition,Long> changelogOffsets)`? The KIP should describe this in more detail.Also, if my assumption is correct, we might want to rename the parameterand also have a better JavaDoc description?

103: Should we make it mandatory (long-term) that all stores (includingcustom stores) manage their offsets internally? Maintaining both optionsand thus both code paths puts a burden on everyone and make the codemessy. I would strongly prefer if we could have mid-term path to get ridof supporting both. -- For this case, we should deprecate the newlyadded `managesOffsets()` method right away, to point out that we intendto remove it. If it's mandatory to maintain offsets for stores, we won'tneed this method any longer. In memory stores can just return null from#committedOffset().

104 "downgrading": I think it might be worth to add support fordowngrading w/o the need to wipe stores? Leveraging `upgrade.from`parameter, we could build a two rolling bounce downgrade: (1) the newcode is started with `upgrade.from` set to a lower version, telling theruntime to do the cleanup on `close()` -- (ie, ensure that all data iswritten into `.checkpoint` and `.position` file, and the newly added CLis deleted). In a second, rolling bounce, the old code would be able toopen RocksDB. -- I understand that this implies much more work, butdowngrade seems to be common enough, that it might be worth it? Even ifwe did not always support this in the past, we have the face the factthat KS is getting more and more adopted and as a more mature productshould support this?





-Matthias







On 4/21/24 11:58 PM, Bruno Cadonna wrote:

Hi all,

How should we proceed here?

1. with the plain .checkpoint file

2. with a way to use the state store interface on unassigned but locallyexisting task state

While I like option 2, I think option 1 is less risky and will give usthe benefits of transactional state stores sooner. We should considerthe interface approach afterwards, though.



Best,
Bruno



On 4/17/24 3:15 PM, Bruno Cadonna wrote:

Hi Nick and Sophie,

I think the task ID is not enough to create a state store that canread the offsets of non-assigned tasks for lag computation duringrebalancing. The state store also needs the state directory so that itknows where to find the information that it needs to return fromchangelogOffsets().

In general, I think we should proceed with the plain .checkpoint filefor now and iterate back to the state store solution later since itseems it is not that straightforward. Alternatively, Nick couldtimebox an effort to better understand what would be needed for thestate store solution. Nick, let us know your decision.

Regarding your question about the state store instance. I am not toofamiliar with that part of the code, but I think the state store isbuild when the processor topology is build and the processor topologyis build per stream task. So there is one instance of processortopology and state store per stream task. Try to follow the call in [1].


Best,
Bruno

[1]https://github.com/apache/kafka/blob/f52575b17225828d2ff11996030ab7304667deab/streams/src/main/java/org/apache/kafka/streams/processor/internals/ActiveTaskCreator.java#L153




On 4/16/24 8:59 PM, Nick Telford wrote:

That does make sense. The one thing I can't figure out is how per-Task
StateStore instances are constructed.

It looks like we construct one StateStore instance for the wholeTopology(in InternalTopologyBuilder), and pass that intoProcessorStateManager (via

StateManagerUtil) for each Task, which then initializes it.

This can't be the case though, otherwise multiple partitions of the same
sub-topology (aka Tasks) would share the same StateStore instance, which
they don't.

What am I missing?

On Tue, 16 Apr 2024 at 16:22, Sophie Blee-Goldman<sop...@responsive.dev>

wrote:

I don't think we need to *require* a constructor accept the TaskId,but we

would definitely make sure that the RocksDB state store changes its
constructor to one that accepts the TaskID (which we can do without

deprecation since its an internal API), and custom state stores canjust

decide for themselves whether they want to opt-in/use the TaskId param
or not. I mean custom state stores would have to opt-in anyways by
implementing the new StoreSupplier#get(TaskId) API and the only
reason to do that would be to have created a constructor that accepts
a TaskId

Just to be super clear about the proposal, this is what I had in mind.
It's actually fairly simple and wouldn't add much to the scope of the

KIP (I think -- if it turns out to be more complicated than I'massuming,

we should definitely do whatever has the smallest LOE to get this done

Anyways, the (only) public API changes would be to add this new
method to the StoreSupplier API:

default T get(final TaskId taskId) {
     return get();
}

We can decide whether or not to deprecate the old #get but it's not
really necessary and might cause a lot of turmoil, so I'd personally
say we just leave both APIs in place.

And that's it for public API changes! Internally, we would just adapt
each of the rocksdb StoreSupplier classes to implement this new
API. So for example with the RocksDBKeyValueBytesStoreSupplier,
we just add

@Override
public KeyValueStore<Bytes, byte[]> get(final TaskId taskId) {
     return returnTimestampedStore ?
         new RocksDBTimestampedStore(name, metricsScope(), taskId) :
         new RocksDBStore(name, metricsScope(), taskId);
}

And of course add the TaskId parameter to each of the actual
state store constructors returned here.

Does that make sense? It's entirely possible I'm missing something
important here, but I think this would be a pretty small addition that
would solve the problem you mentioned earlier while also being
useful to anyone who uses custom state stores.

On Mon, Apr 15, 2024 at 10:21 AM Nick Telford <nick.telf...@gmail.com>
wrote:

Hi Sophie,

Interesting idea! Although what would that mean for the StateStore
interface? Obviously we can't require that the constructor take the

TaskId.

Is it enough to add the parameter to the StoreSupplier?

Would doing this be in-scope for this KIP, or are we over-complicating

it?

Nick
On Fri, 12 Apr 2024 at 21:30, Sophie Blee-Goldman<sop...@responsive.dev
wrote:
Somewhat minor point overall, but it actually drives me crazy thatyou
can't get access to the taskId of a StateStore until #init is called.
This
has caused me a huge headache personally (since the same is true for
processors and I was trying to do something that's probably too hacky

to

actually complain about here lol)
Can we just change the StateStoreSupplier to receive and passalong thetaskId when creating a new store? Presumably by adding a newversion of
the
#get method that takes in a taskId parameter? We can have itdefault to
invoking the old one for compatibility reasons and it should be
completely
safe to tack on.

Would also prefer the same for a ProcessorSupplier, but that's

definitely

outside the scope of this KIP

On Fri, Apr 12, 2024 at 3:31 AM Nick Telford <nick.telf...@gmail.com>
wrote:

On further thought, it's clear that this can't work for one simple

reason:

StateStores don't know their associated TaskId (and hence, their
StateDirectory) until the init() call. Therefore, committedOffset()

can't

be called before init(), unless we also added a StateStoreContext

argument

to committedOffset(), which I think might be trying to shoehorn too

much

into committedOffset().

I still don't like the idea of the Streams engine maintaining the

cache

of

changelog offsets independently of stores, mostly because of the
maintenance burden of the code duplication, but it looks like we'll

have

to

live with it.

Unless you have any better ideas?

Regards,
Nick

On Wed, 10 Apr 2024 at 14:12, Nick Telford <nick.telf...@gmail.com>

wrote:

Hi Bruno,

Immediately after I sent my response, I looked at the codebase and

came

to

the same conclusion. If it's possible at all, it will need to be

done

by

creating temporary StateManagers and StateStores during rebalance.

think

it is possible, and probably not too expensive, but the devil will

be

in

the detail.

I'll try to find some time to explore the idea to see if it's

possible

and

report back, because we'll need to determine this before we can

vote

on

the

KIP.

Regards,
Nick

On Wed, 10 Apr 2024 at 11:36, Bruno Cadonna <cado...@apache.org>

wrote:

Hi Nick,

Thanks for reacting on my comments so quickly!


2.
Some thoughts on your proposal.
State managers (and state stores) are parts of tasks. If the task

is

not

assigned locally, we do not create those tasks. To get the offsets

with

your approach, we would need to either create kind of inactive

tasks

besides active and standby tasks or store and manage state

managers

of

non-assigned tasks differently than the state managers of assigned
tasks. Additionally, the cleanup thread that removes unassigned

task

directories needs to concurrently delete those inactive tasks or
task-less state managers of unassigned tasks. This seems all quite

messy

to me.
Could we create those state managers (or state stores) for locally
existing but unassigned tasks on demand when
TaskManager#getTaskOffsetSums() is executed? Or have a different
encapsulation for the unused task directories?


Best,
Bruno



On 4/10/24 11:31 AM, Nick Telford wrote:

Hi Bruno,

Thanks for the review!

1, 4, 5.
Done

3.
You're right. I've removed the offending paragraph. I had

originally

adapted this from the guarantees outlined in KIP-892. But it's

difficult to

provide these guarantees without the KIP-892 transaction

buffers.

Instead,

we'll add the guarantees back into the JavaDoc when KIP-892

lands.


2.
Good point! This is the only part of the KIP that was

(significantly)

changed when I extracted it from KIP-892. My prototype currently

maintains

this "cache" of changelog offsets in .checkpoint, but doing so

becomes

very

messy. My intent with this change was to try to better

encapsulate

this

offset "caching", especially for StateStores that can cheaply

provide

the

offsets stored directly in them without needing to duplicate

them

in

this

cache.

It's clear some more work is needed here to better encapsulate

this.

My

immediate thought is: what if we construct *but don't

initialize*

the

StateManager and StateStores for every Task directory on-disk?

That

should

still be quite cheap to do, and would enable us to query the

offsets

for

all on-disk stores, even if they're not open. If the

StateManager

(aka.

ProcessorStateManager/GlobalStateManager) proves too expensive

to

hold

open

for closed stores, we could always have a "StubStateManager" in

its

place,

that enables the querying of offsets, but nothing else?

IDK, what do you think?

Regards,

Nick

On Tue, 9 Apr 2024 at 15:00, Bruno Cadonna <cado...@apache.org>

wrote:

Hi Nick,

Thanks for breaking out the KIP from KIP-892!

Here a couple of comments/questions:

1.
In Kafka Streams, we have a design guideline which says to not

use

the

"get"-prefix for getters on the public API. Could you please

change

getCommittedOffsets() to committedOffsets()?


2.
It is not clear to me how TaskManager#getTaskOffsetSums()

should

read

offsets of tasks the stream thread does not own but that have a

state

directory on the Streams client by calling
StateStore#getCommittedOffsets(). If the thread does not own a

task

it

does also not create any state stores for the task, which means

there

is

no state store on which to call getCommittedOffsets().
I would have rather expected that a checkpoint file is written

for

all

state stores on close -- not only for the RocksDBStore -- and

that

this

checkpoint file is read in TaskManager#getTaskOffsetSums() for

the

tasks

that have a state directory on the client but are not currently

assigned

to any stream thread of the Streams client.


3.
In the javadocs for commit() you write

"... all writes since the last commit(Map), or since

init(StateStore)

*MUST* be available to readers, even after a restart."

This is only true for a clean close before the restart, isn't

it?

If the task fails with a dirty close, Kafka Streams cannot

guarantee

that the in-memory structures of the state store (e.g. memtable

in

the

case of RocksDB) are flushed so that the records and the

committed

offsets are persisted.


4.
The wrapper that provides the legacy checkpointing behavior is

actually

an implementation detail. I would remove it from the KIP, but

still

state that the legacy checkpointing behavior will be supported

when

the

state store does not manage the checkpoints.


5.
Regarding the metrics, could you please add the tags, and the

recording

level (DEBUG or INFO) as done in KIP-607 or KIP-444.


Best,
Bruno

On 4/7/24 5:35 PM, Nick Telford wrote:

Hi everyone,

Based on some offline discussion, I've split out the "Atomic

Checkpointing"

section from KIP-892: Transactional Semantics for StateStores,

into

its

own

KIP

KIP-1035: StateStore managed changelog offsets

https://cwiki.apache.org/confluence/display/KAFKA/KIP-1035%3A+StateStore+managed+changelog+offsets


While KIP-892 was adopted *with* the changes outlined in

KIP-1035,

these

changes were always the most contentious part, and continued

to

spur

discussion even after KIP-892 was adopted.

All the changes introduced in KIP-1035 have been removed from

KIP-892,

and

a hard dependency on KIP-1035 has been added to KIP-892 in

their

place.


I'm hopeful that with some more focus on this set of changes,

we

can

deliver something that we're all happy with.

Regards,
Nick

Re: [DISCUSS] KIP-1035: StateStore managed changelog offsets

Reply via email to