[
https://issues.apache.org/jira/browse/SAMZA-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156731#comment-14156731
]
Jay Kreps commented on SAMZA-428:
---------------------------------
Let me give the rationale here.
I agree that tuning caching in the setup we have is quite complex because there
are effectively three levels:
1. Our in heap row cache
2. LevelDB/RocksDB uncompressed block cache
3. LevelDB/RocksDB compressed blocks cached in the filesystem
How to correctly allocate memory between these optimally is pretty workload
specific.
The row cache (a) avoids serialization overhead, (b) avoids writes to Kafka and
disk I/O entirely, (c) is extremely wasteful of memory. The memory waste is
worth considering because of the number of java objects that end up cached, it
is very unlikely you can get to more than 30% useful data versus object, heap,
and data structure overhead. So for big chunks of memory I suspect the
filesystem or RocksDB cache is better.
So why have an in-process cache at all? The rationale was that there are
actually lots of simple cases that can be vastly improved with even a very
small in-process cache. These are cases where you are incrementing a small
number of counters over and over again. Logging out each change is very
expensive and the serialization overhead is really high since each increment
requires deserialization and reserialization. By defaulting to just a small
in-process cache I think we can make the case of a small data set pretty
efficient out of the box at the cost of just a little bit of memory.
> Investigate: how to tune down caching in the KeyValueStore implementations
> --------------------------------------------------------------------------
>
> Key: SAMZA-428
> URL: https://issues.apache.org/jira/browse/SAMZA-428
> Project: Samza
> Issue Type: Improvement
> Components: kv
> Affects Versions: 0.8.0
> Reporter: Chinmay Soman
> Fix For: 0.8.0
>
>
> Currently, we have a 'CachedStore' layer on top of the KeyValueStore
> implementation that we use. This might lead to double caching:
> i) Once at the CachedStore layer
> ii) Possibly cached again in the specific K-V store that we use (for eg:
> RocksDB / BDB)
> We need the CachedStore layer so that the writes to LoggedStore (if
> configured) are done in an efficient manner.
> We can then potentially do some config tuning for the K-V store to reduce its
> memory footprint and simply write to disk.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)