Hi John, I think you're referring to the "record cache" that's provided by the ThreadCache class?
1-3. I was hoping to (eventually) remove the "flush-on-commit" behaviour from RocksDbStore, so that RocksDB can choose when to flush memtables, enabling users to tailor RocksDB performance to their workload. Explicitly flushing the Record Cache to files instead would entail either flushing on every commit, or the current behaviour, of flushing on every commit provided at least 10K records have been processed. Compared with RocksDB-managed memtable flushing, this is very inflexible. If we pursue this design, I highly recommend replacing the hard-coded 10K limit with something configurable so that users can tune flush behaviour for their workloads. 4. Tracking the changelog offsets in another CF and atomically updating it with the main CFs is orthogonal, I think, as it can be done when using memtables provided the "Atomic Flush" feature of RocksDB is enabled. This is something I'd originally planned for this KIP, but we're trying to pull out into a later KIP to make things more manageable. > * we don't fragment memory between the RecordCache and the memtables I think by memory fragmentation, you mean duplication, because we're caching the records both in the (on-heap) Record Cache and the RocksDB memtables? This is a good point that I hadn't considered before. Wouldn't a simpler solution be to just disable the record cache for RocksDB stores (by default), and let the memtables do the caching? Although I guess that would reduce read performance, which could be especially important for joins. > * RecordCache gives far higher performance than memtable for reads and writes I'll concede this point. The JNI boundary plus RocksDB record encoding will likely make it impossible to ever match the Record Cache on throughput. > * we don't need any new "transaction" concepts or memory bound configs Maybe. Unless I'm mistaken, the Record Cache only retains the most recently written value for a key, which would mean that Interactive Queries would always observe new record values *before* they're committed to the changelog. While this is the current behaviour, it's also a violation of consistency, because successive IQ could observe a regression of a value, due to an error writing to the changelog (e.g. a changelog transaction rollback or a timeout). This is something that KIP-892 aims to improve on, as the current design would ensure that records are only observed by IQ *after* they have been committed to the Kafka changelog. That said, it definitely sounds *feasible*. Regards, Nick