Hi all,

The [current documentation](
https://kafka.apache.org/35/javadoc/org/apache/kafka/streams/state/ReadOnlyKeyValueStore.html#range(K,K)
)  for ReadOnlyKeyValueStore#range() states that:

> Order is not guaranteed as bytes lexicographical ordering might not
represent key order.

That makes senseā€”a the ordering of the two keys inserted via `store.put()`
as determined by the `compareTo()` method is not what determines the
ordering in the store; rather, it's the compareTo() of the serialized
byte[] array that matters.

Some observations after playing with it for over a year:

A ) The behavior when you open a store for IQ and don't specify a specific
partition is that (behind the scenes) a store is opened for one partition,
and when that store is exhausted, then the next partition is opened. No
guarantees about which partition is opened in what order. As such, if you
just System.out.println() all the keys from the iterator, they are not
ordered properly.

B) WITHIN a partition, such as if you do a .withPartition() when requesting
the ReadOnlyKeyValueStore, keys are indeed ordered properly according to
the bytes produced by the key serializer.

We at LittleHorse rely upon that behavior for API pagination, and if that
behavior were to change it would break some things.

After some digging, it turns out that the reason why we *do* indeed get
lexicographical ordering of results according to the byte[] array of the
keys is because that is a public contract exposed by RocksDB.

I had asked Matthias offline if it would be possible to open a PR to
clarify on the documentation that all results *within a partition of the
Store* are ordered by the byte[] representation of the key, since I would
feel more comfortable relying upon a publicly documented API.

However, there are a few counterpoints to this:

- ReadOnlyKeyValueStore is an *interface*, not an implementation. The
lexicographical ordering is something we observe from the RocksDB
implementation. If the store were implemented with, for example, a HashMap,
this would not work.

- The semantics of ordering thus seem to be more associated with the
*implementation* rather than with the *interface*.

- Is it possible at all to add a clarification on the RocksDB store that
this behavior is a guarantee? Would that require a KIP?

I'd be super-happy if I could open a PR to put a public documentation note
somewhere on some implementation of a State Store that documents that this
ordering by byte[] representation is guaranteed for range scans, but I do
recognize that making a public documentation note is a contract, and as
such may require a KIP and/or not be accepted.

Any thoughts?

Thanks for reading,
Colt McNealy

*Founder, LittleHorse.dev*

Reply via email to