[ https://issues.apache.org/jira/browse/CASSANDRA-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076347#comment-17076347 ]
Stefania Alborghetti commented on CASSANDRA-15229: -------------------------------------------------- bq. The current implementation isn't really a bump the pointer allocator? It's bitmap based, though with a very tiny bitmap. Sorry it's been a while. Of course the current implementation is also bitmap based. The point is that it is not suitable for long lived buffers, similarly to our bump the pointer strategy. The transient case is easy to solve, either approach would work. bq. Could you elaborate on how these work, as my intuition is that anything designed for a thread-per-core architecture probably won't translate so well to the present state of the world. Though, either way, I suppose this is probably orthogonal to this ticket as we only need to address the {{ChunkCache}} part. The thread-per-core architecture makes it easy to identify threads that do most of the work and cause most of the contention. However, thread identification can be achieved also with thread pools or we can simply give all threads a local stash of buffers, provided that we return it when the thread dies. I don't think there is any other dependency on TPC beyond this. The design choice was mostly dictated by the size of the cache: with AIO reads the OS page cache is bypassed, and the chunk cache needs therefore to be very large, which is not the case if we use Java NIO reads or if we eventually implement asynchronous reads with the new uring API, bypassing AIO completely (which I do recommend). bq. We also optimized the chunk cache to store memory addresses rather than byte buffers, which significantly reduced heap usage. The byte buffers are materialized on the fly. bq. This would be a huge improvement, and a welcome backport if it is easy - though it might (I would guess) depend on Unsafe, which may be going away soon. It's orthogonal to this ticket, though, I think Yes it's based on the Unsafe. The addresses come from the slabs, and then we use the Unsafe to create hollow buffers and to set the address. This is an optimization and it clearly belongs to a separate ticket. {quote} We changed the chunk cache to always store buffers of the same size. We have global lists of these slabs, sorted by buffer size where each size is a power-of-two. How do these two statements reconcile? {quote} So let's assume the current workload is mostly on a table with 4k chunks, which translate to 4k buffers in the cache. Let's also assume that the workload is shifting towards another table, with 8k chunks. Alternatively, let's assume compression is ON, and an ALTER TABLE changes the chunk size. So now the chunk cache is slowly evicting 4k buffers and retaining 8k buffers. These buffers come from two different lists: the list of slabs serving 4k and the list serving 8k. Even if we collect all unused 4k slabs, until each slab has every single buffer returned, there will be wasted memory and we do not control how long that will take. To be fair, it's an extreme case, and we were perhaps over cautions in addressing this possibility by fixing the size of buffers in the cache. So it's possible that the redesigned buffer pool may work even with the current chunk cache implementation. bq. Is it your opinion that your entire ChunkCache implementation can be dropped wholesale into 4.0? I would assume it is still primarily multi-threaded. If so, it might be preferable to trying to fix the existing ChunkCache The changes to the chunk cache are not trivial and should be left as a follow up for 4.x or later in my opinion. The changes to the buffer pool can be dropped in 4.0 if you think that: - they are safe even in the presence of the case described above. - they are justified: memory wasted due to fragmentation is perhaps not an issue with a cache as little as 512 MB I'll try to share some code so you can have a clearer picture. > BufferPool Regression > --------------------- > > Key: CASSANDRA-15229 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15229 > Project: Cassandra > Issue Type: Bug > Components: Local/Caching > Reporter: Benedict Elliott Smith > Assignee: ZhaoYang > Priority: Normal > Fix For: 4.0, 4.0-beta > > > The BufferPool was never intended to be used for a {{ChunkCache}}, and we > need to either change our behaviour to handle uncorrelated lifetimes or use > something else. This is particularly important with the default chunk size > for compressed sstables being reduced. If we address the problem, we should > also utilise the BufferPool for native transport connections like we do for > internode messaging, and reduce the number of pooling solutions we employ. > Probably the best thing to do is to improve BufferPool’s behaviour when used > for things with uncorrelated lifetimes, which essentially boils down to > tracking those chunks that have not been freed and re-circulating them when > we run out of completely free blocks. We should probably also permit > instantiating separate {{BufferPool}}, so that we can insulate internode > messaging from the {{ChunkCache}}, or at least have separate memory bounds > for each, and only share fully-freed chunks. > With these improvements we can also safely increase the {{BufferPool}} chunk > size to 128KiB or 256KiB, to guarantee we can fit compressed pages and reduce > the amount of global coordination and per-allocation overhead. We don’t need > 1KiB granularity for allocations, nor 16 byte granularity for tiny > allocations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org