[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13920720#comment-13920720 ]
Benedict commented on CASSANDRA-6689: ------------------------------------- bq. sort of RCU (i'm looking at you OpOrder) What do you mean here? If you mean read-copy-update, OpOrder is nothing like this. bq. I'm not sure what is to retain here if we do that copy when we send to the wire Ultimately, doing this copying before sending to the wire is something I would like to avoid. Using the RefAction.allocateOnHeap() on top of this copying sees wire transfer speeds for thrift drop by about 10% in my fairly rough-and-ready benchmarks, so obviously copying has a cost. Possibly this cost is due to unavoidably copying data you don't necessarily want to serialise, but it seems to be there. Ultimately if we want to get in-memory read operations to 10x their current performance, we can't go cutting any corners. bq. introducing separate gc I've stated clearly what this introduces as a benefit: overwrite workloads no longer cause excessive flushes bq. things but as we have a fixed number of threads it is going to work out the same way as for buffering open files in the steady system state Your next sentence states how this is a large cause of memory consumption, so surely we should be using that memory if possible for other uses (returning it to the buffer cache, or using it internally for more caching)? bq. Temporary memory allocated by readers is exactly what we should be managing at the first place because they allocate the most and it always the biggest concern for us I agree we should be moving to managing this as well, however I disagree about how we should be managing it. In the medium term we should be bringing the buffer cache in process, so that we can answer some queries without handing off to the mutation stage (anything known to be non-blocking and fast should be answered immediately by the thread that processed the connection), at which point we will benefit from shared use of the memory pool, and concrete control over how much memory readers are using, and zero-copy reads from the buffer cache. I hope we may be able to do this for 3.0. bq. do a simple memcpy test and see how much mb/s can you get from copying from one pre-allocated pool to another Are you performing a full object tree copy, and doing this with a running system to see how it affects the performance of other system components? If not, it doesn't seem to be a useful comparison. Note that this will still create a tremendous amount of heap churn, as most of the memory used by objects right now is on-heap. So copying the records is almost certainly no better for young gen pressure than what we currently do - in fact, *it probably makes the situation worse*. bq. it's not the memtable which creates the most of the noise and memory presure in the system (even tho it uses big chunk of heap) It may not be causing the young gen pressure you're seeing, but it certainly offers some benefit here by keeping more rows in memory so recent queries are more likely to be answered with zero allocation, so reducing young gen pressure; it is also a foundation for improving the row cache and introducing a shared page cache which could bring us closer to zero allocation reads. It's also not clear to me how you would be managing the reclaim of the off-heap allocations without OpOrder, or do you mean to only use off-heap buffers for readers, or to ref-count any memory as you're reading it? Not using off-heap memory for the memtables would negate the main original point of this ticket: to support larger memtables, thus reducing write amplification. Ref-counting incurs overhead linear to the size of the result set, much like copying, and is also fiddly to get right (not convinced it's cleaner or neater), whereas OpOrder incurs overhead proportional to the number of times you reclaim. So if you're using OpOrder, all you're really talking about is a new RefAction: copyToAllocator() or something. So it doesn't notably reduce complexity, it just reduces the quality of the end result. > Partially Off Heap Memtables > ---------------------------- > > Key: CASSANDRA-6689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Benedict > Assignee: Benedict > Fix For: 2.1 beta2 > > Attachments: CASSANDRA-6689-small-changes.patch > > > Move the contents of ByteBuffers off-heap for records written to a memtable. > (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)