[ 
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922983#comment-13922983
 ] 

Pavel Yaskevich commented on CASSANDRA-6689:
--------------------------------------------

bq. These are already addressed in CASSANDRA-6694.

Is there a branch/patch to see all of the changes involved?

bq. The more we rely on staying within ParNew, the more often we are going to 
exceed it; and reducing the number of ParNew runs is also a good thing. You 
said you have 300ms ParNew pauses, occuring every second? So reducing the max 
latency and total latency is surely a good thing?

I'm not trying to imply that we should rely on the ParNew, I'm just saying that 
all of the read/write requests are short lived enough to stay inside young 
generation region, even if we slip the effect are masked to all other long term 
allocations that we do which get promoted.

bq. How does this work without knowing the maximum size of a result set? We 
can't have a client block forever because we didn't provide enough room in the 
pools. Potentially we could have it error, but this seems inelegant to me, when 
it can be avoided. It also seems a suboptimal way to introduce back pressure, 
since it only affects concurrent reads / large reads. We should raise a ticket 
specifically to address back pressure, IMO, and try to come up with a good all 
round solution to the problem.

Let the users specify directly or if not specified just take a guess based on 
total system memory, plus we can add an option to extend in the run time, of 
any problem that uses database there is a capacity planing stage and use-case 
spec or at least experimentation which would allow to size pools correctly.

bq. I did not mean to imply pauseless globally, but the memory reclaim 
operations introduced here are pauseless, thus reducing pauses overall, as 
whenever we would have had a pause from ParNew/FullGC to reclaim, we would not 
here.

Sorry but I still don't get it, do you mean lock-free/non-blocking or that it 
does no syscalls or something similar? But that doesn't matter for pauses as 
much as allocation throughput and fragmentation of Java GC.

bq. I'm not sure why you think this would be a bad thing. It would only help 
for CL=1, but we are often benchmarked using this, so it's an important thing 
to be fast on if possible, and there are definitely a number of our users who 
are okay with CL=1 for whom faster responses would be great. Faster query 
answering should reduce over-utilisation, assuming some back-pressure built in 
to MessagingService or the co-ordinator managing its outstanding proxied 
requests to ensure it isn't overwhelmed by the responses.

The fact is that we have SEDA at least as a first line of defense for 
over-utilization, so local reads a scheduled directly a different stage, we 
shouldn't be trying to do anything directly in the messaging stage, it adds 
another complications not related to this very ticket.

bq. Do you mean you would use jemalloc for every allocation? In which case 
there are further costs incurred for crossing the JNA barrier so frequently, 
almost certainly outweighing any benefit to using jemalloc. Otherwise we would 
need to maintain free-lists ourselves, or perform compacting GC. Personally I 
think compacting GC is actually much simpler.

As I mentioned, there is a jemalloc implementation in Netty project already 
which is pure Java, so we at least should consider it before trying to 
re-invent.

bq. It would be great to be more NUMA aware, but this is not about traffic over 
the interconnect, but simply with the arrays/memory banks themselves, and 
doesn't address any of the other negative consequences. You'll struggle to get 
more than a few GB/s bandwidth out of a modern CPU given that we are copying 
object trees (even shallow ones - they're still randomly distributed), and we 
don't want to waste any of that if we can avoid it

I'm still not sure how worse it would make the things, Java is the worst of 
cache locality with it's object placement anyway but we are not going to be 
copying deep trees. Let me outline the steps that I want to see to be taken to 
make incremental, which is how we usually do things for Cassandra project: 

# code an off-heap allocator or use existing one like on of the ByteBufAlloc 
implementations (evaluate new vs. existing);
# Change memtables to use allocator for step #1 and copy data to heap buffers 
when it's read from memtable so it's easy to track the life time of buffers;
# Do an extensive testing to check how horrible the copy really is for 
performance, find ways to optimize;
# If everything is bad, switch from copy to reference tracking (in all of the 
commands, native protocol etc.);
# Do an extensive testing to check if it improves the situation;
# Change serialization/deserialization to use new allocator (pooled buffer 
instead of always allocating on heap);
# Same as #5.


> Partially Off Heap Memtables
> ----------------------------
>
>                 Key: CASSANDRA-6689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to