[ 
https://issues.apache.org/jira/browse/CASSANDRA-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076347#comment-17076347
 ] 

Stefania Alborghetti commented on CASSANDRA-15229:
--------------------------------------------------

bq. The current implementation isn't really a bump the pointer allocator? It's 
bitmap based, though with a very tiny bitmap. 

Sorry it's been a while. Of course the current implementation is also bitmap 
based. The point is that it is not suitable for long lived buffers, similarly 
to our bump the pointer strategy. The transient case is easy to solve, either 
approach would work.

bq. Could you elaborate on how these work, as my intuition is that anything 
designed for a thread-per-core architecture probably won't translate so well to 
the present state of the world. Though, either way, I suppose this is probably 
orthogonal to this ticket as we only need to address the {{ChunkCache}} part.

The thread-per-core architecture makes it easy to identify threads that do most 
of the work and cause most of the contention. However, thread identification 
can be achieved also with thread pools or we can simply give all threads a 
local stash of buffers, provided that we return it when the thread dies. I 
don't think there is any other dependency on TPC beyond this. 

The design choice was mostly dictated by the size of the cache: with AIO reads 
the OS page cache is bypassed, and the chunk cache needs therefore to be very 
large, which is not the case if we use Java NIO reads or if we eventually 
implement asynchronous reads with the new uring API, bypassing AIO completely 
(which I do recommend). 

bq. We also optimized the chunk cache to store memory addresses rather than 
byte buffers, which significantly reduced heap usage. The byte buffers are 
materialized on the fly.

bq. This would be a huge improvement, and a welcome backport if it is easy - 
though it might (I would guess) depend on Unsafe, which may be going away soon. 
It's orthogonal to this ticket, though, I think

Yes it's based on the Unsafe. The addresses come from the slabs, and then we 
use the Unsafe to create hollow buffers and to set the address. This is an 
optimization and it clearly belongs to a separate ticket.

{quote}
    We changed the chunk cache to always store buffers of the same size.

    We have global lists of these slabs, sorted by buffer size where each size 
is a power-of-two.

How do these two statements reconcile?
{quote}

So let's assume the current workload is mostly on a table with 4k chunks, which 
translate to 4k buffers in the cache. Let's also assume that the workload is 
shifting towards another table, with 8k chunks. Alternatively, let's assume 
compression is ON, and an ALTER TABLE changes the chunk size. So now the chunk 
cache is slowly evicting 4k buffers and retaining 8k buffers. These buffers 
come from two different lists: the list of slabs serving 4k and the list 
serving 8k. Even if we collect all unused 4k slabs, until each slab has every 
single buffer returned, there will be wasted memory and we do not control how 
long that will take. To be fair, it's an extreme case, and we were perhaps over 
cautions in addressing this possibility by fixing the size of buffers in the 
cache. So it's possible that the redesigned buffer pool may work even with the 
current chunk cache implementation. 

bq. Is it your opinion that your entire ChunkCache implementation can be 
dropped wholesale into 4.0? I would assume it is still primarily 
multi-threaded. If so, it might be preferable to trying to fix the existing 
ChunkCache

The changes to the chunk cache are not trivial and should be left as a follow 
up for 4.x or later in my opinion. 

The changes to the buffer pool can be dropped in 4.0 if you think that:

- they are safe even in the presence of the case described above. 
- they are justified: memory wasted due to fragmentation is perhaps not an 
issue with a cache as little as 512 MB

I'll try to share some code so you can have a clearer picture. 


> BufferPool Regression
> ---------------------
>
>                 Key: CASSANDRA-15229
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15229
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Caching
>            Reporter: Benedict Elliott Smith
>            Assignee: ZhaoYang
>            Priority: Normal
>             Fix For: 4.0, 4.0-beta
>
>
> The BufferPool was never intended to be used for a {{ChunkCache}}, and we 
> need to either change our behaviour to handle uncorrelated lifetimes or use 
> something else.  This is particularly important with the default chunk size 
> for compressed sstables being reduced.  If we address the problem, we should 
> also utilise the BufferPool for native transport connections like we do for 
> internode messaging, and reduce the number of pooling solutions we employ.
> Probably the best thing to do is to improve BufferPool’s behaviour when used 
> for things with uncorrelated lifetimes, which essentially boils down to 
> tracking those chunks that have not been freed and re-circulating them when 
> we run out of completely free blocks.  We should probably also permit 
> instantiating separate {{BufferPool}}, so that we can insulate internode 
> messaging from the {{ChunkCache}}, or at least have separate memory bounds 
> for each, and only share fully-freed chunks.
> With these improvements we can also safely increase the {{BufferPool}} chunk 
> size to 128KiB or 256KiB, to guarantee we can fit compressed pages and reduce 
> the amount of global coordination and per-allocation overhead.  We don’t need 
> 1KiB granularity for allocations, nor 16 byte granularity for tiny 
> allocations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to