[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922830#comment-13922830 ]
Pavel Yaskevich commented on CASSANDRA-6689: -------------------------------------------- You probably still don't understand my point so let me clarify, I only care about 3 things: maintainability, consistency, performance. This is a big chunk of code which somebody has to maintain, which allows of inconsistent style (can do it with refferer = null or maybe somehow else situation, maybe put RefAction as "null" in the argument or maybe RefAction.impossible() but it really need to look throught it to make sure that it check for "null" everywhere and so on), and brings it's own assumptions e.g. "_" in front, also adding a poor performance. Before that is addressed, I'm -1 of this. vnodes were a big chunk of work but people were able to split into roadmap and successfully finish, so I don't see any reason why we can't do it here. bq. Any scheme that copies data will inherently incur larger GC pressure, as we then copy for memtable reads as well as disk reads. Object overhead is in fact larger than the payload for many workloads, so even if we have arenas this effect is not eliminated or even appreciably ameliorated. For disk reads we have to copy even for mmap, so we don't keep any references on deletion time and files can be safely deallocated. So why don't copy directly to the memory allocated by the pool?... Object overhead would stay inside ParNew bounds (for (< p999)) so object allocation is relatively cheap comparing to everything else, that's the goal of JVM as a whole. bq. Temporary reader space (and hence your approach) is not predictable: it is not proportional to the number of readers, but to the number and size of columns the readers read. In fact it is larger than this, as we probably have to copy anything we might want to use (given the way the code is encapsulated, this is what I do currently when copying on-heap - anything else would introduce notable complexity), not just columns that end up in the result set. Doesn't matter how many emphasises you put here it won't make it this argument stronger because, as the main idea is to have those pools of a fixed size which would create back-pressure to client in the situations of heavy load which is exactly what operators want - go gradually slower without extreme latency disturbance. bq. We appear to be in agreement that your approach has higher costs associated with it. Further, copying potentially GB/s of (randomly located) data around destroys the CPU cache, reduces peak memory bandwidth by inducing strobes, consumes bandwidth directly, wastes CPU cycles waiting for the random lookups; all to no good purpose. We should be reducing these costs, not introducing more. Let's say we live in the modern NUMA world, so we are going to do the following pin the group threads to CPU cores so we have fixed scope of allocation of different things, that why there is no significant bus pressure for copy among other things JVM/Cassandra does with memory (not even significant cache coherency traffic). bq. It is simply not clear, despite your assertion of clarity, how you would reclaim any freed memory without "separate GC" (what else is GC but this reclamation?), however you want to call it, when it will be interspersed with non-freed memory, nor how you would guard the non-atomic copying (ref-counting, OpOrder, Lock: what?). Without this information it is not clear to me that it would be any simpler either. The same way as jemalloc or any other allocator does it, it least that is not reinventing the wheel. bq. Pauseless operation, so improved predictability What do you mean by this, we still leave on the JVM, do we not? Also what would it do in the low memory situation? allocate from heap? wait? This is not pauseless operation. bq. Lock-freedom and low overhead, so we move closer to being able to answer queries directly from the messaging threads themselves, improving latency and throughput We won't be able to answer queries directly from the messaging threads for the number of reasons not even indirectly related to your approach, at least for not breaking SEDA, which also supposed to be a safe guide for over utilization. bq. An alternative approach needs, IMO, to demonstrate a clear superiority to the patch that is already available, especially when it will incur further work to produce. It is not clear to me that your solution is superior in any regard, nor any simpler. It also seems to be demonstrably less predictable and more costly, so I struggle to see how it could be considered preferable. Overall, I'm not questioning the idea of being able to track what goes where would be great, I'm questioning implementation and trade-offs comparing to other approaches. > Partially Off Heap Memtables > ---------------------------- > > Key: CASSANDRA-6689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Benedict > Assignee: Benedict > Fix For: 2.1 beta2 > > Attachments: CASSANDRA-6689-small-changes.patch > > > Move the contents of ByteBuffers off-heap for records written to a memtable. > (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)