[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables

Pavel Yaskevich (JIRA) Thu, 06 Mar 2014 10:13:27 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922830#comment-13922830
 ]


Pavel Yaskevich commented on CASSANDRA-6689:
--------------------------------------------

You probably still don't understand my point so let me clarify, I only care 
about 3 things: maintainability, consistency, performance. This is a big chunk 
of code which somebody has to maintain, which allows of inconsistent style (can 
do it with refferer = null or maybe somehow else situation, maybe put RefAction 
as "null" in the argument or maybe RefAction.impossible() but it really need to 
look throught it to make sure that it check for "null" everywhere and so on), 
and brings it's own assumptions e.g. "_" in front, also adding a poor 
performance.  Before that is addressed, I'm -1 of this. vnodes were a big chunk 
of work but people were able to split into roadmap and successfully finish, so 
I don't see any reason why we can't do it here.

bq. Any scheme that copies data will inherently incur larger GC pressure, as we 
then copy for memtable reads as well as disk reads. Object overhead is in fact 
larger than the payload for many workloads, so even if we have arenas this 
effect is not eliminated or even appreciably ameliorated.

For disk reads we have to copy even for mmap, so we don't keep any references 
on deletion time and files can be safely deallocated. So why don't copy 
directly to the memory allocated by the pool?... Object overhead would stay 
inside ParNew bounds (for (< p999)) so object allocation is relatively cheap 
comparing to everything else, that's the goal of JVM as a whole.

bq. Temporary reader space (and hence your approach) is not predictable: it is 
not proportional to the number of readers, but to the number and size of 
columns the readers read. In fact it is larger than this, as we probably have 
to copy anything we might want to use (given the way the code is encapsulated, 
this is what I do currently when copying on-heap - anything else would 
introduce notable complexity), not just columns that end up in the result set.

Doesn't matter how many emphasises you put here it won't make it this argument 
stronger because, as the main idea is to have those pools of a fixed size which 
would create back-pressure to client in the situations of heavy load which is 
exactly what operators want - go gradually slower without extreme latency 
disturbance.

bq. We appear to be in agreement that your approach has higher costs associated 
with it. Further, copying potentially GB/s of (randomly located) data around 
destroys the CPU cache, reduces peak memory bandwidth by inducing strobes, 
consumes bandwidth directly, wastes CPU cycles waiting for the random lookups; 
all to no good purpose. We should be reducing these costs, not introducing more.

Let's say we live in the modern NUMA world, so we are going to do the following 
pin the group threads to CPU cores so we have fixed scope of allocation of 
different things, that why there is no significant bus pressure for copy among 
other things JVM/Cassandra does with memory (not even significant cache 
coherency traffic).

bq. It is simply not clear, despite your assertion of clarity, how you would 
reclaim any freed memory without "separate GC" (what else is GC but this 
reclamation?), however you want to call it, when it will be interspersed with 
non-freed memory, nor how you would guard the non-atomic copying (ref-counting, 
OpOrder, Lock: what?). Without this information it is not clear to me that it 
would be any simpler either.

The same way as jemalloc or any other allocator does it, it least that is not 
reinventing the wheel.

bq. Pauseless operation, so improved predictability

What do you mean by this, we still leave on the JVM, do we not? Also what would 
it do in the low memory situation? allocate from heap? wait? This is not 
pauseless operation.

bq. Lock-freedom and low overhead, so we move closer to being able to answer 
queries directly from the messaging threads themselves, improving latency and 
throughput

We won't be able to answer queries directly from the messaging threads for the 
number of reasons not even indirectly related to your approach, at least for 
not breaking SEDA, which also supposed to be a safe guide for over utilization.

bq. An alternative approach needs, IMO, to demonstrate a clear superiority to 
the patch that is already available, especially when it will incur further work 
to produce. It is not clear to me that your solution is superior in any regard, 
nor any simpler. It also seems to be demonstrably less predictable and more 
costly, so I struggle to see how it could be considered preferable.

Overall, I'm not questioning the idea of being able to track what goes where 
would be great, I'm questioning implementation and trade-offs comparing to 
other approaches.




> Partially Off Heap Memtables
> ----------------------------
>
>                 Key: CASSANDRA-6689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-6689) Partially Off Heap Memtables

Reply via email to