[ https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921110#comment-13921110 ]
Pavel Yaskevich edited comment on CASSANDRA-6689 at 3/5/14 6:55 PM: -------------------------------------------------------------------- bq. I've stated clearly what this introduces as a benefit: overwrite workloads no longer cause excessive flushes If you do a copy before of the memtable buffer, you can clearly put it back to the allocator once it's overwritten or becomes otherwise useless, in the process of merging columns with previous row contents. bq. Your next sentence states how this is a large cause of memory consumption, so surely we should be using that memory if possible for other uses (returning it to the buffer cache, or using it internally for more caching)? It doesn't state that is a *large cause of memory consumption*, it states that it has additional cost but it the steady state it won't be allocating over the limit because of the properties of the system that we have, namely the fixed number of threads. bq. Are you performing a full object tree copy, and doing this with a running system to see how it affects the performance of other system components? If not, it doesn't seem to be a useful comparison. Note that this will still create a tremendous amount of heap churn, as most of the memory used by objects right now is on-heap. So copying the records is almost certainly no better for young gen pressure than what we currently do - in fact, it probably makes the situation worse. Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level deep so just allocate additional space for the object headers and do a copy, most of the work would be spend by doing a copy of the data (name/value) anyway, so as we want to live inside of ParNew (so we can just discard already dead objects), see how many such allocations you will be able to do in e.g. 1 second then wipe the whole thing (equivalent of ParNew with rejects dead + compacts) and do it again. We are doing mlockall too which should make that even faster as we are sure that heap is pre-faulted already. bq. It may not be causing the young gen pressure you're seeing, but it certainly offers some benefit here by keeping more rows in memory so recent queries are more likely to be answered with zero allocation, so reducing young gen pressure; it is also a foundation for improving the row cache and introducing a shared page cache which could bring us closer to zero allocation reads. _And so on...._ I'm not sure how this would help in the case of row cache, once reference is added to the row cache it means that memtable would hang in there until that row is purged, so if there is a long lived row (write once, read multiple times) in each of the regions (and we reclaim based on regions) would that keep memtable around longer than expected? bq. It's also not clear to me how you would be managing the reclaim of the off-heap allocations without OpOrder, or do you mean to only use off-heap buffers for readers, or to ref-count any memory as you're reading it? Not using off-heap memory for the memtables would negate the main original point of this ticket: to support larger memtables, thus reducing write amplification. Ref-counting incurs overhead linear to the size of the result set, much like copying, and is also fiddly to get right (not convinced it's cleaner or neater), whereas OpOrder incurs overhead proportional to the number of times you reclaim. So if you're using OpOrder, all you're really talking about is a new RefAction: copyToAllocator() or something. So it doesn't notably reduce complexity, it just reduces the quality of the end result. In terms of memory usage copy adds additional linear cost, yes, but at the same time it makes the system behavior more controllable/predictable which is what ops usually care about, where, even with the artificial stress test, there seems to be a low once off-heap feature is enabled which is no surprise once you look at how much complexity does it actually add. bq. Also, I'd love to see some evidence for this (particularly the latter). I'm not disputing it, just would like to see what caused you to reach these conclusions. These definitely warrant separate tickets IMO, but if you have evidence for it, it would help direct any work. Well, it seems like you never operated a real Cassandra cluster, did you? All of the problems that I have listed here are well known, you can even simulate this with docker VMs and making internal network gradually slower, there is *no* back pressure mechanism built-in so right now Cassandra would accept a bunch or operations on the normal speed (if the outgoing link is physically different than internal, which should always be the case) but suddenly would just stop accepting anything and fail internally because of GC storm caused by all of the internode buffers hanging around and that would spread across the cluster very quickly. was (Author: xedin): bq. I've stated clearly what this introduces as a benefit: overwrite workloads no longer cause excessive flushes If you do a copy before of the memtable buffer, you can clearly put it back to the allocator once it's overwritten or becomes otherwise useless, in the process of merging columns with previous row contents. bq. Your next sentence states how this is a large cause of memory consumption, so surely we should be using that memory if possible for other uses (returning it to the buffer cache, or using it internally for more caching)? It doesn't state that is a *large cause of memory consumption*, it states that it has additional cost but it the steady state it don't be allocating over the limit because of the properties of the system that we have, namely the fixed number of threads. bq. Are you performing a full object tree copy, and doing this with a running system to see how it affects the performance of other system components? If not, it doesn't seem to be a useful comparison. Note that this will still create a tremendous amount of heap churn, as most of the memory used by objects right now is on-heap. So copying the records is almost certainly no better for young gen pressure than what we currently do - in fact, it probably makes the situation worse. Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level deep so just allocate additional space for the object headers and do a copy, most of the work would be spend by doing a copy of the data (name/value) anyway, so as we want to live inside of ParNew, see how many such allocations you will be able to do in e.g. 1 second then wipe the whole thing and do it again. We are doing mlockall too which should make that even faster as we are sure that heap is pre-faulted already. bq. It may not be causing the young gen pressure you're seeing, but it certainly offers some benefit here by keeping more rows in memory so recent queries are more likely to be answered with zero allocation, so reducing young gen pressure; it is also a foundation for improving the row cache and introducing a shared page cache which could bring us closer to zero allocation reads. _And so on...._ I'm not sure how this would help in the case of row cache, once reference is added to the row cache it means that memtable would hang in there until that row is purged, so if there is a long lived row (write once, read multiple times) in each of the regions (and we reclaim based on regions) would that keep memtable around longer than expected? bq. It's also not clear to me how you would be managing the reclaim of the off-heap allocations without OpOrder, or do you mean to only use off-heap buffers for readers, or to ref-count any memory as you're reading it? Not using off-heap memory for the memtables would negate the main original point of this ticket: to support larger memtables, thus reducing write amplification. Ref-counting incurs overhead linear to the size of the result set, much like copying, and is also fiddly to get right (not convinced it's cleaner or neater), whereas OpOrder incurs overhead proportional to the number of times you reclaim. So if you're using OpOrder, all you're really talking about is a new RefAction: copyToAllocator() or something. So it doesn't notably reduce complexity, it just reduces the quality of the end result. In terms of memory usage copy adds additional linear cost yes but at the same time it makes the system behavior more controllable/predictable which is what ops usually care about where, even on the artificial stress test, there seems to be a low once off-heap feature is enabled which is no surprise once you look at how much complexity does it actually add. bq. Also, I'd love to see some evidence for this (particularly the latter). I'm not disputing it, just would like to see what caused you to reach these conclusions. These definitely warrant separate tickets IMO, but if you have evidence for it, it would help direct any work. Well, it seems like you never operated a real Cassandra cluster, did you? All of the problems that I have listed here are well known, you can even simulate this with docker VMs and making internal network gradually slower, there is no back pressure mechanism built-in so right now Cassandra would accept a bunch or operations on the normal speed (if the outgoing link is physically different than internal) but suddenly would just stop accepting anything and fail internally because of GC storm caused by all of the internode buffers hanging around. > Partially Off Heap Memtables > ---------------------------- > > Key: CASSANDRA-6689 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6689 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Benedict > Assignee: Benedict > Fix For: 2.1 beta2 > > Attachments: CASSANDRA-6689-small-changes.patch > > > Move the contents of ByteBuffers off-heap for records written to a memtable. > (See comments for details) -- This message was sent by Atlassian JIRA (v6.2#6252)