[ 
https://issues.apache.org/jira/browse/CASSANDRA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921110#comment-13921110
 ] 

Pavel Yaskevich edited comment on CASSANDRA-6689 at 3/5/14 6:55 PM:
--------------------------------------------------------------------

bq. I've stated clearly what this introduces as a benefit: overwrite workloads 
no longer cause excessive flushes

If you do a copy before of the memtable buffer, you can clearly put it back to 
the allocator once it's overwritten or becomes otherwise useless, in the 
process of merging columns with previous row contents.

bq. Your next sentence states how this is a large cause of memory consumption, 
so surely we should be using that memory if possible for other uses (returning 
it to the buffer cache, or using it internally for more caching)?

It doesn't state that is a *large cause of memory consumption*, it states that 
it has additional cost but it the steady state it won't be allocating over the 
limit because of the properties of the system that we have, namely the fixed 
number of threads.

bq. Are you performing a full object tree copy, and doing this with a running 
system to see how it affects the performance of other system components? If 
not, it doesn't seem to be a useful comparison. Note that this will still 
create a tremendous amount of heap churn, as most of the memory used by objects 
right now is on-heap. So copying the records is almost certainly no better for 
young gen pressure than what we currently do - in fact, it probably makes the 
situation worse.

Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level 
deep so just allocate additional space for the object headers and do a copy, 
most of the work would be spend by doing a copy of the data (name/value) 
anyway, so as we want to live inside of ParNew (so we can just discard already 
dead objects), see how many such allocations you will be able to do in e.g. 1 
second then wipe the whole thing (equivalent of  ParNew with rejects dead + 
compacts) and do it again. We are doing mlockall too which should make that 
even faster as we are sure that heap is pre-faulted already.

bq. It may not be causing the young gen pressure you're seeing, but it 
certainly offers some benefit here by keeping more rows in memory so recent 
queries are more likely to be answered with zero allocation, so reducing young 
gen pressure; it is also a foundation for improving the row cache and 
introducing a shared page cache which could bring us closer to zero allocation 
reads. _And so on...._

I'm not sure how this would help in the case of row cache, once reference is 
added to the row cache it means that memtable would hang in there until that 
row is purged, so if there is a long lived row (write once, read multiple 
times) in each of the regions (and we reclaim based on regions) would that keep 
memtable around longer than expected?

bq. It's also not clear to me how you would be managing the reclaim of the 
off-heap allocations without OpOrder, or do you mean to only use off-heap 
buffers for readers, or to ref-count any memory as you're reading it? Not using 
off-heap memory for the memtables would negate the main original point of this 
ticket: to support larger memtables, thus reducing write amplification. 
Ref-counting incurs overhead linear to the size of the result set, much like 
copying, and is also fiddly to get right (not convinced it's cleaner or 
neater), whereas OpOrder incurs overhead proportional to the number of times 
you reclaim. So if you're using OpOrder, all you're really talking about is a 
new RefAction: copyToAllocator() or something. So it doesn't notably reduce 
complexity, it just reduces the quality of the end result.

In terms of memory usage copy adds additional linear cost, yes, but at the same 
time it makes the system behavior more controllable/predictable which is what 
ops usually care about, where, even with the artificial stress test, there 
seems to be a low once off-heap feature is enabled which is no surprise once 
you look at how much complexity does it actually add.

bq. Also, I'd love to see some evidence for this (particularly the latter). I'm 
not disputing it, just would like to see what caused you to reach these 
conclusions. These definitely warrant separate tickets IMO, but if you have 
evidence for it, it would help direct any work.

Well, it seems like you never operated a real Cassandra cluster, did you? All 
of the problems that I have listed here are well known, you can even simulate 
this with docker VMs and making internal network gradually slower, there is 
*no* back pressure mechanism built-in so right now Cassandra would accept a 
bunch or operations on the normal speed (if the outgoing link is physically 
different than internal, which should always be the case) but suddenly would 
just stop accepting anything and fail internally because of GC storm caused by 
all of the internode buffers hanging around and that would spread across the 
cluster very quickly.



was (Author: xedin):
bq. I've stated clearly what this introduces as a benefit: overwrite workloads 
no longer cause excessive flushes

If you do a copy before of the memtable buffer, you can clearly put it back to 
the allocator once it's overwritten or becomes otherwise useless, in the 
process of merging columns with previous row contents.

bq. Your next sentence states how this is a large cause of memory consumption, 
so surely we should be using that memory if possible for other uses (returning 
it to the buffer cache, or using it internally for more caching)?

It doesn't state that is a *large cause of memory consumption*, it states that 
it has additional cost but it the steady state it don't be allocating over the 
limit because of the properties of the system that we have, namely the fixed 
number of threads.

bq. Are you performing a full object tree copy, and doing this with a running 
system to see how it affects the performance of other system components? If 
not, it doesn't seem to be a useful comparison. Note that this will still 
create a tremendous amount of heap churn, as most of the memory used by objects 
right now is on-heap. So copying the records is almost certainly no better for 
young gen pressure than what we currently do - in fact, it probably makes the 
situation worse.

Do you mean this? Let's say we copy a Cell (or Column object), which is 1 level 
deep so just allocate additional space for the object headers and do a copy, 
most of the work would be spend by doing a copy of the data (name/value) 
anyway, so as we want to live inside of ParNew, see how many such allocations 
you will be able to do in e.g. 1 second then wipe the whole thing and do it 
again. We are doing mlockall too which should make that even faster as we are 
sure that heap is pre-faulted already.

bq. It may not be causing the young gen pressure you're seeing, but it 
certainly offers some benefit here by keeping more rows in memory so recent 
queries are more likely to be answered with zero allocation, so reducing young 
gen pressure; it is also a foundation for improving the row cache and 
introducing a shared page cache which could bring us closer to zero allocation 
reads. _And so on...._

I'm not sure how this would help in the case of row cache, once reference is 
added to the row cache it means that memtable would hang in there until that 
row is purged, so if there is a long lived row (write once, read multiple 
times) in each of the regions (and we reclaim based on regions) would that keep 
memtable around longer than expected?

bq. It's also not clear to me how you would be managing the reclaim of the 
off-heap allocations without OpOrder, or do you mean to only use off-heap 
buffers for readers, or to ref-count any memory as you're reading it? Not using 
off-heap memory for the memtables would negate the main original point of this 
ticket: to support larger memtables, thus reducing write amplification. 
Ref-counting incurs overhead linear to the size of the result set, much like 
copying, and is also fiddly to get right (not convinced it's cleaner or 
neater), whereas OpOrder incurs overhead proportional to the number of times 
you reclaim. So if you're using OpOrder, all you're really talking about is a 
new RefAction: copyToAllocator() or something. So it doesn't notably reduce 
complexity, it just reduces the quality of the end result.

In terms of memory usage copy adds additional linear cost yes but at the same 
time it makes the system behavior more controllable/predictable which is what 
ops usually care about where, even on the artificial stress test, there seems 
to be a low once off-heap feature is enabled which is no surprise once you look 
at how much complexity does it actually add.

bq. Also, I'd love to see some evidence for this (particularly the latter). I'm 
not disputing it, just would like to see what caused you to reach these 
conclusions. These definitely warrant separate tickets IMO, but if you have 
evidence for it, it would help direct any work.

Well, it seems like you never operated a real Cassandra cluster, did you? All 
of the problems that I have listed here are well known, you can even simulate 
this with docker VMs and making internal network gradually slower, there is no 
back pressure mechanism built-in so right now Cassandra would accept a bunch or 
operations on the normal speed (if the outgoing link is physically different 
than internal) but suddenly would just stop accepting anything and fail 
internally because of GC storm caused by all of the internode buffers hanging 
around.


> Partially Off Heap Memtables
> ----------------------------
>
>                 Key: CASSANDRA-6689
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6689
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6689-small-changes.patch
>
>
> Move the contents of ByteBuffers off-heap for records written to a memtable.
> (See comments for details)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to