[ 
https://issues.apache.org/jira/browse/CASSANDRA-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072895#comment-17072895
 ] 

Stefania Alborghetti commented on CASSANDRA-15229:
--------------------------------------------------

We hit this buffer pool regression problem in our DSE fork a while ago. Because 
our chunk cache became much larger when it replaced the OS page cache, off-heap 
memory was growing significantly beyond the limits configured. This was partly 
due to some leaks, but the fragmentation in the current design of the buffer 
pool was a big part of it.

This is how we solved it:

 - a bump-the-pointer slab approach for the transient pool, not to dissimilar 
from the current implementation. We then exploit our thread per core 
architecture: core threads get a dedicated slab each, other threads share a 
global slab.

 - a bitmap-based slab approach for the permanent pool, which is only used by 
the chunk cache. These slabs can only issue buffers of the same size, one bit 
is flipped in the bitmap for each buffer issued. When multiple buffers are 
requested, the slab tries to issue consecutive addresses but this is not 
guaranteed since we want to avoid memory fragmentation. We have global lists of 
these slabs, sorted by buffer size where each size is a power-of-two. Slabs are 
taken out of these lists when they are full, and they are put back into 
circulation when they have space available. The lists are global but core 
threads get a thread-local stash of buffers, i.e. they request multiple buffers 
at the same time in order to reduce contention on the global lists.

We changed the chunk cache to always store chunks of the same size. If we need 
to read chunks of a different size, we use an array of buffers in the cache and 
we request multiple buffers at the same time. If we get consecutive addresses, 
we optimize for this case by building a single byte buffer over the first 
address. We also optimized the chunk cache to store memory addresses rather 
than byte buffers, which significantly reduced heap usage. The byte buffers are 
materialized on the fly.

For the permanent case, we made the choice of constraining the size of the 
buffers in the cache so that memory in the pool could be fully used. This may 
or may not be what people prefer. Our choice was due to the large size of the 
cache, 20+ GB. An approach that allows some memory fragmentation may be 
sufficient for smaller cache sizes.

Please let me know if there is interest in porting this solution to 4.0 or 4.x. 
I can share the code if needed.



> BufferPool Regression
> ---------------------
>
>                 Key: CASSANDRA-15229
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15229
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Caching
>            Reporter: Benedict Elliott Smith
>            Assignee: ZhaoYang
>            Priority: Normal
>             Fix For: 4.0, 4.0-beta
>
>
> The BufferPool was never intended to be used for a {{ChunkCache}}, and we 
> need to either change our behaviour to handle uncorrelated lifetimes or use 
> something else.  This is particularly important with the default chunk size 
> for compressed sstables being reduced.  If we address the problem, we should 
> also utilise the BufferPool for native transport connections like we do for 
> internode messaging, and reduce the number of pooling solutions we employ.
> Probably the best thing to do is to improve BufferPool’s behaviour when used 
> for things with uncorrelated lifetimes, which essentially boils down to 
> tracking those chunks that have not been freed and re-circulating them when 
> we run out of completely free blocks.  We should probably also permit 
> instantiating separate {{BufferPool}}, so that we can insulate internode 
> messaging from the {{ChunkCache}}, or at least have separate memory bounds 
> for each, and only share fully-freed chunks.
> With these improvements we can also safely increase the {{BufferPool}} chunk 
> size to 128KiB or 256KiB, to guarantee we can fit compressed pages and reduce 
> the amount of global coordination and per-allocation overhead.  We don’t need 
> 1KiB granularity for allocations, nor 16 byte granularity for tiny 
> allocations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to