[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922983#comment-13922983 ]
Pavel Yaskevich commented on CASSANDRA-6689: -------------------------------------------- bq. These are already addressed in CASSANDRA-6694. Is there a branch/patch to see all of the changes involved? bq. The more we rely on staying within ParNew, the more often we are going to exceed it; and reducing the number of ParNew runs is also a good thing. You said you have 300ms ParNew pauses, occuring every second? So reducing the max latency and total latency is surely a good thing? I'm not trying to imply that we should rely on the ParNew, I'm just saying that all of the read/write requests are short lived enough to stay inside young generation region, even if we slip the effect are masked to all other long term allocations that we do which get promoted. bq. How does this work without knowing the maximum size of a result set? We can't have a client block forever because we didn't provide enough room in the pools. Potentially we could have it error, but this seems inelegant to me, when it can be avoided. It also seems a suboptimal way to introduce back pressure, since it only affects concurrent reads / large reads. We should raise a ticket specifically to address back pressure, IMO, and try to come up with a good all round solution to the problem. Let the users specify directly or if not specified just take a guess based on total system memory, plus we can add an option to extend in the run time, of any problem that uses database there is a capacity planing stage and use-case spec or at least experimentation which would allow to size pools correctly. bq. I did not mean to imply pauseless globally, but the memory reclaim operations introduced here are pauseless, thus reducing pauses overall, as whenever we would have had a pause from ParNew/FullGC to reclaim, we would not here. Sorry but I still don't get it, do you mean lock-free/non-blocking or that it does no syscalls or something similar? But that doesn't matter for pauses as much as allocation throughput and fragmentation of Java GC. bq. I'm not sure why you think this would be a bad thing. It would only help for CL=1, but we are often benchmarked using this, so it's an important thing to be fast on if possible, and there are definitely a number of our users who are okay with CL=1 for whom faster responses would be great. Faster query answering should reduce over-utilisation, assuming some back-pressure built in to MessagingService or the co-ordinator managing its outstanding proxied requests to ensure it isn't overwhelmed by the responses. The fact is that we have SEDA at least as a first line of defense for over-utilization, so local reads a scheduled directly a different stage, we shouldn't be trying to do anything directly in the messaging stage, it adds another complications not related to this very ticket. bq. Do you mean you would use jemalloc for every allocation? In which case there are further costs incurred for crossing the JNA barrier so frequently, almost certainly outweighing any benefit to using jemalloc. Otherwise we would need to maintain free-lists ourselves, or perform compacting GC. Personally I think compacting GC is actually much simpler. As I mentioned, there is a jemalloc implementation in Netty project already which is pure Java, so we at least should consider it before trying to re-invent. bq. It would be great to be more NUMA aware, but this is not about traffic over the interconnect, but simply with the arrays/memory banks themselves, and doesn't address any of the other negative consequences. You'll struggle to get more than a few GB/s bandwidth out of a modern CPU given that we are copying object trees (even shallow ones - they're still randomly distributed), and we don't want to waste any of that if we can avoid it I'm still not sure how worse it would make the things, Java is the worst of cache locality with it's object placement anyway but we are not going to be copying deep trees. Let me outline the steps that I want to see to be taken to make incremental, which is how we usually do things for Cassandra project: # code an off-heap allocator or use existing one like on of the ByteBufAlloc implementations (evaluate new vs. existing); # Change memtables to use allocator for step #1 and copy data to heap buffers when it's read from memtable so it's easy to track the life time of buffers; # Do an extensive testing to check how horrible the copy really is for performance, find ways to optimize; # If everything is bad, switch from copy to reference tracking (in all of the commands, native protocol etc.); # Do an extensive testing to check if it improves the situation; # Change serialization/deserialization to use new allocator (pooled buffer instead of always allocating on heap); # Same as #5. > Partially Off Heap Memtables > ---------------------------- > > Key: CASSANDRA-6689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Benedict > Assignee: Benedict > Fix For: 2.1 beta2 > > Attachments: CASSANDRA-6689-small-changes.patch > > > Move the contents of ByteBuffers off-heap for records written to a memtable. > (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)