[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196296#comment-14196296 ]
Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 4:13 PM: -------------------------------------------------------------------- bq. No we don't. We have locks per Segment, this is very similar to lock stripping/Java's concurrent hash map. Thanks for clearing that up bq. Not really we lock globally when we reach 100% of the space and we freeup to 80% of the space and we spread the overhead to other threads based on who ever has the item partition lock. It won't be hard to make this part of the queue thread and will try it for the next release of lruc. OK, that make sense. 20% of the cache could be many milliseconds of work if you are using many gigabytes of cache. That's not a great thing to foist on a random victim thread. If you handed that to the queue thread, well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? The queue thread could go out to lunch for a while. Not a big deal, but finer grained scheduling will probably be necessary. bq. If you look at the code closer to memcached. Actually I started of stripping memcached code so we can run it in process instead of running as a separate process and removing the global locks in queue reallocation etc and eventually diverged too much from it. The other reason it doesn't use slab allocators is because we wanted the memory allocators to do the right thing we already have tested Cassandra with Jemalloc. Ah very cool. jemalloc is not a moving allocator where as it looks like memcached slabs implement rebalancing to accommodate changes in size distribution. That would actually be one of the really nice things to keep IMO. On large memory systems with a cache that scales and performs you would end up dedicating as much RAM as possible to the row cache/key cache and not the page cache since the page cache is not as granular (correct me if the story for C* is different). If you dedicate 80% of RAM to the cache that doesn't leave a lot of space left for fragmentation. By using a heap allocator you also lose the ability to implement hard predictable limits on memory used by the cache since you didn't map it yourself. I could be totally off base and jemalloc might be good enough. bq. There is some comments above which has the reasoning for it (please see the above comments). PS: I believe there was some tickets on Current RowCache complaining about the overhead. I don't have a performance beef with JNI, especially the way you have done which I think is pretty efficient. I think the overhead of JNI (one or two slightly more expensive function calls) would be eclipsed by things like the cache misses, coherence, and pipeline stalls that are part of accessing and maintaining a concurrent cache (Java or C++). It's all just intuition without comparative microbenchmarks of the two caches. Java might look a little faster just due to allocator performance, but we know you pay for that in other ways. I think what you have made scratches the itch for a large cache quite well, and beats the status quo. I don't agree that Unsafe couldn't do the exact same thing with no on heap references. The hash table, ring buffer, and individual item entries are all being malloced and you can do that from Java using Unsafe. You don't need to implement a ring buffer because you can use Disruptor. I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler, and I know performance is not an issue, fast enough code. I don't want to advocate doing something different for performance, but rather that there is the possibility of a relatively simple implementation via Unsafe. You could coalesce all the contended fields for each instance (stats, lock field, LRU head) into a single cache line, and then rely on a single barrier when releasing a coarse grained lock. The fine grained locking and CASing results in several pipeline stalls because the memory barriers that are implicit in each one require the store buffers to drain. There may even be a suitable off heap map implementation out there already. was (Author: aweisberg): .bq No we don't. We have locks per Segment, this is very similar to lock stripping/Java's concurrent hash map. Thanks for clearing that up .bq Not really we lock globally when we reach 100% of the space and we freeup to 80% of the space and we spread the overhead to other threads based on who ever has the item partition lock. It won't be hard to make this part of the queue thread and will try it for the next release of lruc. OK, that make sense. 20% of the cache could be many milliseconds of work if you are using many gigabytes of cache. That's not a great thing to foist on a random victim thread. If you handed that to the queue thread, well I think you run into another issue which is that the ring buffer doesn't appear to check for queue full? The queue thread could go out to lunch for a while. Not a big deal, but finer grained scheduling will probably be necessary. .bq If you look at the code closer to memcached. Actually I started of stripping memcached code so we can run it in process instead of running as a separate process and removing the global locks in queue reallocation etc and eventually diverged too much from it. The other reason it doesn't use slab allocators is because we wanted the memory allocators to do the right thing we already have tested Cassandra with Jemalloc. Ah very cool. jemalloc is not a moving allocator where as it looks like memcached slabs implement rebalancing to accommodate changes in size distribution. That would actually be one of the really nice things to keep IMO. On large memory systems with a cache that scales and performs you would end up dedicating as much RAM as possible to the row cache/key cache and not the page cache since the page cache is not as granular (correct me if the story for C* is different). If you dedicate 80% of RAM to the cache that doesn't leave a lot of space left for fragmentation. By using a heap allocator you also lose the ability to implement hard predictable limits on memory used by the cache since you didn't map it yourself. I could be totally off base and jemalloc might be good enough. .bq There is some comments above which has the reasoning for it (please see the above comments). PS: I believe there was some tickets on Current RowCache complaining about the overhead. I don't have a performance beef with JNI, especially the way you have done which I think is pretty efficient. I think the overhead of JNI (one or two slightly more expensive function calls) would be eclipsed by things like the cache misses, coherence, and pipeline stalls that are part of accessing and maintaining a concurrent cache (Java or C++). It's all just intuition without comparative microbenchmarks of the two caches. Java might look a little faster just due to allocator performance, but we know you pay for that in other ways. I think what you have made scratches the itch for a large cache quite well, and beats the status quo. I don't agree that Unsafe couldn't do the exact same thing with no on heap references. The hash table, ring buffer, and individual item entries are all being malloced and you can do that from Java using Unsafe. You don't need to implement a ring buffer because you can use Disruptor. I also wonder if splitting the cache into several instances each with a coarse lock per instance wouldn't result in simpler, and I know performance is not an issue, fast enough code. I don't want to advocate doing something different for performance, but rather that there is the possibility of a relatively simple implementation via Unsafe. You could coalesce all the contended fields for each instance (stats, lock field, LRU head) into a single cache line, and then rely on a single barrier when releasing a coarse grained lock. The fine grained locking and CASing results in several pipeline stalls because the memory barriers that are implicit in each one require the store buffers to drain. There may even be a suitable off heap map implementation out there already. > Serializing Row cache alternative (Fully off heap) > -------------------------------------------------- > > Key: CASSANDRA-7438 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Linux > Reporter: Vijay > Assignee: Vijay > Labels: performance > Fix For: 3.0 > > Attachments: 0001-CASSANDRA-7438.patch > > > Currently SerializingCache is partially off heap, keys are still stored in > JVM heap as BB, > * There is a higher GC costs for a reasonably big cache. > * Some users have used the row cache efficiently in production for better > results, but this requires careful tunning. > * Overhead in Memory for the cache entries are relatively high. > So the proposal for this ticket is to move the LRU cache logic completely off > heap and use JNI to interact with cache. We might want to ensure that the new > implementation match the existing API's (ICache), and the implementation > needs to have safe memory access, low overhead in memory and less memcpy's > (As much as possible). > We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)