Thanks, Sagar and Bruno, for your insights and comments!
Sagar: Can we name according to the semantics that you want to
support like `getAsOf` or something like that? I am not sure if we do that
in our codebase though. Maybe the experts can chime in.
Because it is a new method that will be added, we should be able to name it
whatever we like. I agree `getAsOf` is more clear, albeit wordier.
Introducing `getAsOf(key, timestamp)` means we could leave open `get(key,
timeFrom, timeTo)` to have an exclusive `timeTo` without introducing a
collision. (We could introduce `getBetween(key, timeFrom, timeTo)` instead
to delineate even more clearly, though this is better left for a future
KIP.)
I don't think there's any existing precedent in codebase to follow here but
I'll leave that to the experts. Curious to hear what others prefer as well.
Sagar: With delete, we would stlll keep the older versions of the key
right?
We could certainly choose this for the semantics of delete(...) -- and it
sounds like we should too, based on Bruno's confirmation below that this
feels more natural to him as well -- but as Bruno noted in his message
below I think we'll want the method signature to be `delete(key,
timestamp)` then, so that there is an explicit timestamp to associate with
the deletion. In other words, `delete(key, timestamp)` has the same effect
as `put(key, null, timestamp)`. The only difference is that the `put(...)`
method has a `void` return type, while `delete(key, timestamp)` can have
`ValueAndTimestamp` as return type in order to return the record which is
replaced (if any). In other words, `delete(key, timestamp)` is equivalent
to `put(key, null, timestamp)` followed by `get(key, timestamp)`.
Bruno: I would also not change the semantics so that it deletes all
versions of
a key. I would rather add a new method purge(key) or
deleteAllVersions(key) or similar if we want to have such a method in
this first KIP.
Makes sense; I'm convinced. Let's defer
`purge(key)`/`deleteAllVersions(key)` to a future KIP. If there's agreement
that `delete(key, timestamp)` (as described above) is valuable, we can keep
it in this first KIP even though it is syntactic sugar. If this turns into
a larger discussion, we can defer this to a future KIP as well.
Bruno: I would treat the history retention as a strict limit. [...] You
could also add historyRetentionMs() to the VersionedKeyValueStore<K, V>
interface to make the concept of the history retention part of the
interface.
OK. That's the second vote for rewording the javadoc for
`VersionedKeyValueStore#get(key, timestampTo)` to remove the parenthetical
and clarify that history retention should be used to dictate this case, so
I'll go ahead and do that. I'll leave out adding `historyRetentionMs()` to
the interface for now, though, for the sake of consistency with other
stores (e.g., window stores) which don't expose similar types of
configurations from their interfaces.
Bruno: exclusive vs inclusive regarding validTo timestamp in get().
Doesn't this decision depend on the semantics of the join for which this
state store should be used?
Yes, you are correct. As a user I would expect that a stream-side record
with the same timestamp as a table-side record _would_ produce a join
result, which is consistent with the proposal for timestampTo to be
inclusive. (FWIW I tried this out with a Flink temporal join just now and
observed this result as well. Not sure where to look for other standards to
validate this expectation.)
Bruno: If Streams does not update min.compaction.lag.ms during
rebalances,
users have to do it each time they change history retention in the code,
right? That seems odd to me. What is the actual reason for not updating
the config? How does Streams handle updates to windowed stores?
Yes, users will have to update min.compaction.lag.ms for the changelog
topic themselves if they update history retention in their code. This is
consistent with what happens for window stores today: e.g., if a user
updates grace period for a windowed aggregation, then they are responsible
for updating retention.ms on their windowed changelog topic as well.
I'm not familiar with the historical context around why this is the case --
Matthias, do you know?
My best guess is that Streams does not want to interfere with any potential
out-of-band changes by the user between application restarts, though I'm
not sure why a user would want to change this specific config to a value
which does not accord with the specified history retention. I notice that
there is code for validating topic configs and collecting validation errors
(
https://github.com/apache/kafka/blob/be032735b39360df1a6de1a7feea8b4336e5bcc0/streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopicManager.java#L318-L319
)
but this method is not called from anywhere, even though there are unit
tests for it. I was unable to find history of this validation after a quick
search. Hopefully Matthias (or others) has context, otherwise I will have a
closer look.
- Victoria
On Wed, Nov 23, 2022 at 8:52 AM Bruno Cadonna <cado...@apache.org> wrote:
Hi all,
Thanks for the KIP, Victoria!
I have a couple of comments.
1. delete(key)
I think delete(key) should not remove all versions of a key. We should
use it to close the validity interval of the last version.
Assuming we have records of different versions for key A:
(A, e, 0, 2),
(A, f, 2, 3),
(A, g, 3, MAX)
delete(A) would update them to
(A, e, 0, 2),
(A, f, 2, 3),
(A, g, 3, 5)
(A, null, 5, MAX)
But then the question arises where does timestamp 5 that closes the
interval in (A, g, 3, 5) and opens the interval in (A, null, 5, MAX)
come from. We could use the timestamp at which delete(A) is called, but
actually I do not like that because it seems to me it opens the doors to
non-determinism. If we use event time for put() we should also use it
for delete(). Actually, put(A, null, 5) would have the same effect as
delete(A) in the example above. As a syntactical sugar, we could add
delete(key, validFrom). (I just realized now that I just repeated what
Victoria said in her previous e-mail.)
I agree with Victoria that delete(A) as defined for other state stores
is hard to re-use in the versioned key-value store.
I would also not change the semantics so that it deletes all versions of
a key. I would rather add a new method purge(key) or
deleteAllVersions(key) or similar if we want to have such a method in
this first KIP.
2. history retention
I would remove "(up to store implementation discretion when this is the
case)". I would treat the history retention as a strict limit. If users
want to implement a less strict behavior, they can still do it. Maybe
mention in the javadocs the implications of not adhering strictly to the
history retention. That is, the DSL might become non-deterministic. You
could also add historyRetentionMs() to the VersionedKeyValueStore<K, V>
interface to make the concept of the history retention part of the
interface.
3. null vs. exception for out-of-bound queries
I am in favor of null. The record version is not there anymore because
it expired. This seems to me normal and nothing exceptional. That would
also consistent with the behavior of other APIs as already mentioned.
4. Exposing segmentInterval
Since we have evidence that the segment interval affects performance, I
would expose it. But I find it also OK to expose it once we have a
corresponding metric.
5. exclusive vs inclusive regarding validTo timestamp in get()
Doesn't this decision depend on the semantics of the join for which this
state store should be used? Should a record on the table side that has
the same timestamp as the record on the stream side join? Or should only
records in the table that are strictly before the record on the stream
side join?
6. Not setting min.compaction.lag.ms during rebalances
If Streams does not update min.compaction.lag.ms during rebalances,
users have to do it each time they change history retention in the code,
right? That seems odd to me. What is the actual reason for not updating
the config? How does Streams handle updates to windowed stores? That
should be a similar situation for the retention time config of the
changelog topic.
Best,
Bruno
On 23.11.22 09:11, Sagar wrote:
Hi Vicky,
Thanks for your response!
I would just use numbers to refer to your comments.
1) Thanks for your response. Even I am not totally sure whether these
should be supported via IQv2 or via store interface. That said, I
wouldn't
definitely qualify this as blocking the KIP for sure so we can live
without it :)
2) Yeah if the 2 APIs for get have different semantics for timestampTo,
then it could be confusing. I went through the link for temporal tables
(TFS!) and I now get why the AS OF semantics would have it inclusive. I
think part of the problem is that the name get on it's own is not as
expressive as SQL. Can we name according to the semantics that you want
to
support like `getAsOf` or something like that? I am not sure if we do
that
in our codebase though. Maybe the experts can chime in.
3) hmm I would have named it `validUpto` But again not very picky about
it.
After going through the link and your KIP, it's a lot clearer to me.
4) I think delete(key) should be sufficient. With delete, we would
stlll keep the older versions of the key right?
Thanks!
Sagar.
On Wed, Nov 23, 2022 at 12:17 AM Victoria Xia
<victoria....@confluent.io.invalid> wrote:
Thanks, Matthias and Sagar, for your comments! I've responded here for
now,
and will update the KIP afterwards with the outcome of our discussions
as
they resolve.
----------- Matthias's comments -----------
(1) Why does the new store not extend KeyValueStore, but StateStore?
In the end, it's a KeyValueStore?
A `VersionedKeyValueStore<K, V>` is not a `KeyValueStore<K, V>`
because
many of the KeyValueStore methods would not make sense for a versioned
store. For example, `put(K key, V value)` is not meaningful for a
versioned
store because the record needs a timestamp associated with it.
A `VersionedKeyValueStore<K, V>` is more similar to a
`KeyValueStore<K,
ValueAndTimestamp<V>>` (i.e., `TimestampedKeyValueStore<K, V>`), but
some
of the TimestampedKeyValueStore methods are still problematic. For
example,
what does it mean for `delete(K key)` to have return type
`ValueAndTimestamp<V>`? Does this mean that `delete(K key)` only
deletes
(and returns) the latest record version for the key? Probably we want
a
versioned store to have `delete(K key)` delete all record versions for
the
given key, in which case the return type is better suited as an
iterator/collection of KeyValueTimestamp. `putIfAbsent(K key,
ValueAndTimestamp value)` also has ambiguous semantics for versioned
stores
(i.e., what does it mean for the key/record to be "absent").
I agree that conceptually a versioned key-value store is just a
key-value
store, though. In the future if we redesign the store interfaces, it'd
be
great to unify them by having a more generic KeyValueStore interface
that
allows for extra flexibility to support different types of key-value
stores, including versioned stores. (Or, if you can think of a way to
achieve this with the existing interfaces today, I'm all ears!)
(2) Should we have a ReadOnlyVersionedKeyValueStore? Even if we don't
want to support IQ in this KIP, it might be good to add this interface
right away to avoid complications for follow up KIPs? Or won't there
by
any complications anyway?
I don't think there will be complications for refactoring to add this
interface in the future. Refactoring out
ReadOnlyVersionedKeyValueStore
from VersionedKeyValueStore would leave VersionedKeyValueStore
unchanged
from the outside.
Also, is it true that the ReadOnlyKeyValueStore interface is only used
for
IQv1 and not IQv2? I think it's an open question as to whether we
should
support IQv1 for versioned stores or only IQv2. If the latter, then
maybe
we won't need the extra interface at all.
(3) Why do we not have a `delete(key)` method? I am ok with not
supporting all methods from existing KV-store, but a `delete(key)`
seems
to be fundamentally to have?
What do you think the semantics of `delete(key)` should be for
versioned
stores? Should `delete(key)` delete (and return) all record versions
for
the key? Or should we have `delete(key, timestamp)` which is
equivalent
to
`put(key, null, timestamp)` except with a return type to return
ValueAndTimestamp representing the record it replaced?
If we have ready alignment on what the interface and semantics for
`delete(key)` should be, then adding it in this KIP sounds good. I
just
didn't want the rest of the KIP to be hung up over additional
interfaces,
given that we can always add extra interfaces in the future.
(4a) Do we need `get(key)`? It seems to be the same as `get(key,
MAX_VALUE)`? Maybe is good to have as syntactic sugar though? Just for
my own clarification (should we add something to the JavaDocs?).
Correct, it is just syntactic sugar. I will add a clarification into
the
Javadocs as you've suggested.
(4b) Should we throw an exception if a user queries out-of-bound
instead of returning `null` (in `get(key,ts)`)?
-> You put it into "rejected alternatives", and I understand your
argument. Would love to get input from others about this question
though. -- It seems we also return `null` for windowed stores, so
maybe
the strongest argument is to align to existing behavior? Or do we have
case for which the current behavior is problematic?
Sure; curious to hear what others think as well.
(4c) JavaDoc on `get(key,ts)` says: "(up to store implementation
discretion when this is the case)" -> Should we make it a stricter
contract such that the user can reason about it better (there is WIP
to
make retention time a strict bound for windowed stores atm)
-> JavaDocs on `persistentVersionedKeyValueStore` seems to
suggest a
strict bound, too.
Ah, great question. I think the question boils down to: do we want to
require that all versioned stores (including custom user
implementations)
use "history retention" to determine when to expire old record
versions?
Because the `persistentVersionedKeyValueStore(...)` method returns
instances of the provided RocksDB-based versioned store
implementation,
which does use history retention for this purpose, that's why we can
very
clearly say that for this store, `get(key, ts)` will return null if
the
provided timestamp bound has fallen out of history retention. The
reason I
left the `VersionedKeyValueStore#get(key, ts)` Javadoc more generic
(i.e.,
does not mention history retention) is because maybe a user
implementing
their own custom store will choose a different expiry mechanism, e.g.,
keep
the three latest versions for each key regardless of how old the
timestamps
are.
If we want to require that all versioned stores use history retention
in
order to determine when to expire old records, then I will certainly
update
the Javadoc to clarify. This is already a requirement for DSL users
because
the VersionedBytesStoreSupplier interface requires history retention
to
be
provided (in order for changelog topic configs to be properly set), so
it's
just a question of whether we also want to require PAPI users to use
history retention too. I had a look at the existing window stores and
didn't see precedent for requiring all window stores have a standard
"retention time" concept for how long to keep windows, but if we want
to
have a standard "history retention" concept for versioned stores we
certainly can. WDYT?
(5a) Do we need to expose `segmentInterval`? For windowed-stores, we
also use segments but hard-code it to two (it was exposed in earlier
versions but it seems not useful, even if we would be open to expose
it
again if there is user demand).
If we want to leave it out of this first KIP (and potentially expose
it
in
the future), that works for me. The performance benchmarks I ran
suggest
that this parameter greatly impacts store performance though and is
very
workload dependent. If a user reported poor performance using
versioned
stores for their workload, this is the first parameter I would want to
tune. That said, metrics/observability for versioned stores (which
would be
helpful for determining how this parameter should be adjusted) have
been
deferred to a follow-up KIP, so perhaps that's reason to defer
exposing
this parameter as well.
(5b) JavaDocs says: "Performance degrades as more record versions for
the same key are collected in a single segment. On the other hand,
out-of-order writes and reads which access older segments may slow
down
if there are too many segments." -- Wondering if JavaDocs should make
any statements about expected performance? Seems to be an
implementation
detail?
I included this sentence to explain why a user might want to tune this
value / help guide how to think about the parameter, but if we want to
remove it entirely (per the discussion point above) then this Javadoc
will
be removed with it.
(6) validTo timestamp is "exclusive", right? Ie, if I query
`get(key,ts[=validToV1])` I would get `null` or the "next" record v2
with validFromV2=ts?
I actually intended for it to be inclusive (will update the KIP). Do
you
think exclusive is more intuitive? The reason I had inclusive in my
mind is
because it's like a "AS OF <time>" query, which treats the time bound
as
inclusive.
(7) The KIP says, that segments are stores in the same RocksDB -- for
this case, how are efficient deletes handled? For windowed-store, we
can
just delete a full RocksDB.
The way that multiple segments are represented in the same RocksDB is
that
the RocksDB keys are prefixed with segment ID. An entire segment is
deleted
with a single `deleteRange()` call to RocksDB.
(8) Rejected alternatives: you propose to not return the validTo
timestamp -- if we find it useful in the future to return it, would
there be a clean path to change it accordingly?
With the current proposal, there's no clean path. If we think there's
a
good chance we might want to do this in the future, then we should
update
the proposed interfaces.
The current proposed return type from `VersionedKeyValueStore<K,
V>#get(key, tsTo)` is `ValueAndTimestamp<V>`. There's no way to add a
second timestamp into `ValueAndTimestamp<V>`, which is why there's no
clean
path to include validTo timestamp in the future under the existing
proposal.
If we wanted to allow for including validTo timestamp in the future,
we'd
instead update the return type to be a new `VersionedRecord<V>`
object.
Today a `VersionedRecord<V>` could just include `value` and
`timestamp`,
and in the future we could add `validTo` (names subject to change)
into
the
`VersionedRecord` as well. (It'd look a little strange for now since
VersionedRecord is the same as ValueAndTimestamp, but that seems
fine.)
If we choose to do this, I think we should also update the return type
of
`VersionedKeyValueStore#get(key)` to be VersionedRecord as well,
rather
than having one return TimestampAndValue while the other returns
VersionedRecord.
----------- Sagar's comments -----------
1) Did you consider adding a method similar to :
List<ValueAndTimeStamp<V>> get(K key, long from, long to)?
I think this could be useful considering that this
versioning scheme unlocks time travel at a key basis. WDYT?
Yes, I do think this method is valuable. I think we will definitely
want to
support time-range based queries at some point (hopefully soon), and
likely
also key-range based queries (to achieve feature parity with existing
key-value stores).
It's not immediately clear to me whether these types of queries should
be
supported as part of the store interface or if they should only be
supported via the `query(...)` method for IQv2. (It's an open question
as
to whether we should support IQv1 for versioned stores or only IQv2. A
benefit of IQv2 over IQv1 is that we won't need to add individual
store
methods for each type of query, including for all wrapped store
layers.)
If we have clear non-IQ use cases for these methods (e.g., use cases
within
processors), then they'll need to be added as part of the store
interface
for sure. I'm leaning towards adding them as part of the store
interface
but given the ambiguity here, it may be preferrable to defer to a
follow-up
KIP. OTOH, if you think the versioned store interface as proposed in
this
KIP is too bare bones to be useful, I'm open to adding it in now as
well.
2) I have a similar question as Matthias, about the timestampTo
argument
when doing a get. Is it inclusive or exclusive?
Same answer (and follow-up question) as above. Do you think it will be
confusing for `get(key, tsTo)` to use an inclusive time bound, while
`get(key, tsFrom, tsTo)` would use an exclusive tsTo time bound? Maybe
we
should rename `get(key, tsFrom, tsTo)` to `getVersions(...)` or
`getRange(...)` in order to avoid confusion.
3) validFrom sounds slightly confusing to me. It is essentially the
timestamp at which the record was inserted. validFrom makes it sound
like
validTo which can keep changing based on new records while *from* is
fixed.
WDYT?
"It is essentially the timestamp at which the record was inserted" <--
Yes,
that's correct.
I borrowed the "validFrom/validTo" terminology from temporal tables,
e.g.,
https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16
.
I don't believe the terms "validFrom" or "validTo" are currently
exposed
anywhere in any of the user-facing interfaces (or Javadocs); I just
needed
a way to refer to the concepts in the KIP. Hopefully this is a
non-issue
(at least for now) as a result. Do you have a suggestion for
terminology
that would've been less confusing?
4) Even I think delete api should be supported.
Makes sense. It'd be to get your input on the same follow-up
questions I
asked Matthias above as well :)
On Tue, Nov 22, 2022 at 4:25 AM Sagar <sagarmeansoc...@gmail.com>
wrote:
Hi Victoria,
Thanks for the KIP. Seems like a very interesting idea!
I have a couple of questions:
1) Did you consider adding a method similar to :
List<ValueAndTimeStamp<V>> get(K key, long from, long to)?
I think this could be useful considering that this
versioning scheme unlocks time travel at a key basis. WDYT?
2) I have a similar question as Matthias, about the timestampTo
argument
when doing a get. Is it inclusive or exclusive?
3) validFrom sounds slightly confusing to me. It is essentially the
timestamp at which the record was inserted. validFrom makes it sound
like
validTo which can keep changing based on new records while *from* is
fixed.
WDYT?
4) Even I think delete api should be supported.
Thanks!
Sagar.
On Tue, Nov 22, 2022 at 8:02 AM Matthias J. Sax <mj...@apache.org>
wrote:
Thanks for the KIP Victoria. Very well written!
Couple of questions (many might just require to add some more
details
to
the KIP):
(1) Why does the new store not extend KeyValueStore, but
StateStore?
In the end, it's a KeyValueStore?
(2) Should we have a ReadOnlyVersionedKeyValueStore? Even if we
don't
want to support IQ in this KIP, it might be good to add this
interface
right away to avoid complications for follow up KIPs? Or won't there
by
any complications anyway?
(3) Why do we not have a `delete(key)` method? I am ok with not
supporting all methods from existing KV-store, but a `delete(key)`
seems
to be fundamentally to have?
(4a) Do we need `get(key)`? It seems to be the same as `get(key,
MAX_VALUE)`? Maybe is good to have as syntactic sugar though? Just
for
my own clarification (should we add something to the JavaDocs?).
(4b) Should we throw an exception if a user queries out-of-bound
instead of returning `null` (in `get(key,ts)`)?
-> You put it into "rejected alternatives", and I understand
your
argument. Would love to get input from others about this question
though. -- It seems we also return `null` for windowed stores, so
maybe
the strongest argument is to align to existing behavior? Or do we
have
case for which the current behavior is problematic?
(4c) JavaDoc on `get(key,ts)` says: "(up to store implementation
discretion when this is the case)" -> Should we make it a stricter
contract such that the user can reason about it better (there is WIP
to
make retention time a strict bound for windowed stores atm)
-> JavaDocs on `persistentVersionedKeyValueStore` seems to
suggest a
strict bound, too.
(5a) Do we need to expose `segmentInterval`? For windowed-stores,
we
also use segments but hard-code it to two (it was exposed in earlier
versions but it seems not useful, even if we would be open to expose
it
again if there is user demand).
(5b) JavaDocs says: "Performance degrades as more record versions
for
the same key are collected in a single segment. On the other hand,
out-of-order writes and reads which access older segments may slow
down
if there are too many segments." -- Wondering if JavaDocs should
make
any statements about expected performance? Seems to be an
implementation
detail?
(6) validTo timestamp is "exclusive", right? Ie, if I query
`get(key,ts[=validToV1])` I would get `null` or the "next" record v2
with validFromV2=ts?
(7) The KIP says, that segments are stores in the same RocksDB --
for
this case, how are efficient deletes handled? For windowed-store, we
can
just delete a full RocksDB.
(8) Rejected alternatives: you propose to not return the validTo
timestamp -- if we find it useful in the future to return it, would
there be a clean path to change it accordingly?
-Matthias
On 11/16/22 9:57 PM, Victoria Xia wrote:
Hi everyone,
I have a proposal for introducing versioned state stores in Kafka
Streams.
Versioned state stores are similar to key-value stores except they
can
store multiple record versions for a single key. This KIP focuses
on
interfaces only in order to limit the scope of the KIP.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-889%3A+Versioned+State+Stores
Thanks,
Victoria