[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

Ariel Weisberg (JIRA) Tue, 04 Nov 2014 08:14:21 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196296#comment-14196296
 ]


Ariel Weisberg edited comment on CASSANDRA-7438 at 11/4/14 4:13 PM:
--------------------------------------------------------------------

bq. No we don't. We have locks per Segment, this is very similar to lock 
stripping/Java's concurrent hash map.
Thanks for clearing that up

bq. Not really we lock globally when we reach 100% of the space and we freeup 
to 80% of the space and we spread the overhead to other threads based on who 
ever has the item partition lock. It won't be hard to make this part of the 
queue thread and will try it for the next release of lruc.

OK, that make sense. 20% of the cache could be many milliseconds of work if you 
are using many gigabytes of cache. That's not a great thing to foist on a 
random victim thread. If you handed that to the queue thread, well I think you 
run into another issue which is that the ring buffer doesn't appear to check 
for queue full? The queue thread could go out to lunch for a while. Not a big 
deal, but finer grained scheduling will probably be necessary.

bq. If you look at the code closer to memcached. Actually I started of 
stripping memcached code so we can run it in process instead of running as a 
separate process and removing the global locks in queue reallocation etc and 
eventually diverged too much from it. The other reason it doesn't use slab 
allocators is because we wanted the memory allocators to do the right thing we 
already have tested Cassandra with Jemalloc.

Ah very cool.

jemalloc is not a moving allocator where as it looks like memcached slabs 
implement rebalancing to accommodate changes in size distribution. That would 
actually be one of the really nice things to keep IMO. On large memory systems 
with a cache that scales and performs you would end up dedicating as much RAM 
as possible to the row cache/key cache and not the page cache since the page 
cache is not as granular (correct me if the story for C* is different). If you 
dedicate 80% of RAM to the cache that doesn't leave a lot of space left for 
fragmentation. By using a heap allocator you also lose the ability to implement 
hard predictable limits on memory used by the cache since you didn't map it 
yourself. I could be totally off base and jemalloc might be good enough.

bq. There is some comments above which has the reasoning for it (please see the 
above comments). PS: I believe there was some tickets on Current RowCache 
complaining about the overhead.
I don't have a performance beef with JNI, especially the way you have done 
which I think is pretty efficient. I think the overhead of JNI (one or two 
slightly more expensive function calls) would be eclipsed by things like the 
cache misses, coherence, and pipeline stalls that are part of accessing and 
maintaining a concurrent cache (Java or C++). It's all just intuition without 
comparative microbenchmarks of the two caches. Java might look a little faster 
just due to allocator performance, but we know you pay for that in other ways.

I think what you have made scratches the itch for a large cache quite well, and 
beats the status quo. I don't agree that Unsafe couldn't do the exact same 
thing with no on heap references.

The hash table, ring buffer, and individual item entries are all being malloced 
and you can do that from Java using Unsafe. You don't need to implement a ring 
buffer because you can use Disruptor. I also wonder if splitting the cache into 
several instances each with a coarse lock per instance wouldn't result in 
simpler, and I know performance is not an issue, fast enough code. I don't want 
to advocate doing something different for performance, but rather that there is 
the possibility of a relatively simple implementation via Unsafe.

You could coalesce all the contended fields for each instance (stats, lock 
field, LRU head) into a single cache line, and then rely on a single barrier 
when releasing a coarse grained lock. The fine grained locking and CASing 
results in several pipeline stalls because the memory barriers that are 
implicit in each one require the store buffers to drain. There may even be a 
suitable off heap map implementation out there already.


was (Author: aweisberg):
.bq No we don't. We have locks per Segment, this is very similar to lock 
stripping/Java's concurrent hash map.
Thanks for clearing that up

.bq Not really we lock globally when we reach 100% of the space and we freeup 
to 80% of the space and we spread the overhead to other threads based on who 
ever has the item partition lock. It won't be hard to make this part of the 
queue thread and will try it for the next release of lruc.

OK, that make sense. 20% of the cache could be many milliseconds of work if you 
are using many gigabytes of cache. That's not a great thing to foist on a 
random victim thread. If you handed that to the queue thread, well I think you 
run into another issue which is that the ring buffer doesn't appear to check 
for queue full? The queue thread could go out to lunch for a while. Not a big 
deal, but finer grained scheduling will probably be necessary.

.bq If you look at the code closer to memcached. Actually I started of 
stripping memcached code so we can run it in process instead of running as a 
separate process and removing the global locks in queue reallocation etc and 
eventually diverged too much from it. The other reason it doesn't use slab 
allocators is because we wanted the memory allocators to do the right thing we 
already have tested Cassandra with Jemalloc.

Ah very cool.

jemalloc is not a moving allocator where as it looks like memcached slabs 
implement rebalancing to accommodate changes in size distribution. That would 
actually be one of the really nice things to keep IMO. On large memory systems 
with a cache that scales and performs you would end up dedicating as much RAM 
as possible to the row cache/key cache and not the page cache since the page 
cache is not as granular (correct me if the story for C* is different). If you 
dedicate 80% of RAM to the cache that doesn't leave a lot of space left for 
fragmentation. By using a heap allocator you also lose the ability to implement 
hard predictable limits on memory used by the cache since you didn't map it 
yourself. I could be totally off base and jemalloc might be good enough.

.bq There is some comments above which has the reasoning for it (please see the 
above comments). PS: I believe there was some tickets on Current RowCache 
complaining about the overhead.
I don't have a performance beef with JNI, especially the way you have done 
which I think is pretty efficient. I think the overhead of JNI (one or two 
slightly more expensive function calls) would be eclipsed by things like the 
cache misses, coherence, and pipeline stalls that are part of accessing and 
maintaining a concurrent cache (Java or C++). It's all just intuition without 
comparative microbenchmarks of the two caches. Java might look a little faster 
just due to allocator performance, but we know you pay for that in other ways.

I think what you have made scratches the itch for a large cache quite well, and 
beats the status quo. I don't agree that Unsafe couldn't do the exact same 
thing with no on heap references.

The hash table, ring buffer, and individual item entries are all being malloced 
and you can do that from Java using Unsafe. You don't need to implement a ring 
buffer because you can use Disruptor. I also wonder if splitting the cache into 
several instances each with a coarse lock per instance wouldn't result in 
simpler, and I know performance is not an issue, fast enough code. I don't want 
to advocate doing something different for performance, but rather that there is 
the possibility of a relatively simple implementation via Unsafe.

You could coalesce all the contended fields for each instance (stats, lock 
field, LRU head) into a single cache line, and then rely on a single barrier 
when releasing a coarse grained lock. The fine grained locking and CASing 
results in several pipeline stalls because the memory barriers that are 
implicit in each one require the store buffers to drain. There may even be a 
suitable off heap map implementation out there already.

> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
>                 Key: CASSANDRA-7438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Linux
>            Reporter: Vijay
>            Assignee: Vijay
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: 0001-CASSANDRA-7438.patch
>
>
> Currently SerializingCache is partially off heap, keys are still stored in 
> JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better 
> results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off 
> heap and use JNI to interact with cache. We might want to ensure that the new 
> implementation match the existing API's (ICache), and the implementation 
> needs to have safe memory access, low overhead in memory and less memcpy's 
> (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

Reply via email to