[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336749#comment-15336749 ] Chris Lohfink commented on CASSANDRA-11327: --- I looked at it a few days ago, seeing metric name change now I'm +1 > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336276#comment-15336276 ] Joshua McKenzie commented on CASSANDRA-11327: - [~cnlwsu]: you good taking review on this? > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328832#comment-15328832 ] Chris Lohfink commented on CASSANDRA-11327: --- Minor Nitpick/bike shedding: all other metric names are camel case, having one with spaces may mess up some tools (ie command line jmx readers). > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328789#comment-15328789 ] Ariel Weisberg commented on CASSANDRA-11327: ||Code|utests|dtests|| |[3.0 code|https://github.com/apache/cassandra/compare/cassandra-3.0...aweisberg:CASSANDRA-11327-3.0?expand=1]|[utests|https://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-3.0-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-3.0-dtest/]| |[trunk code|https://github.com/apache/cassandra/compare/trunk...aweisberg:CASSANDRA-11327-trunk?expand=1]|[utests|https://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-trunk-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-trunk-dtest/]| > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192380#comment-15192380 ] Ariel Weisberg commented on CASSANDRA-11327: I think that even if we only provided a mechanism and the default policy was to maximize memtable utilization before backpressure that would be a big improvement. We could get feedback on what behavior people prefer. It seems like the commit log should not be any more of a bottleneck than it is now. If the CL was able to go fast enough that it could fill up the memtables then it should have enough capacity to do that indefinitely since it doesn't really defer work like flushing or compaction. Yes it's off topic, but I'll make sure it's copied over to an implementation ticket if we get there. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190843#comment-15190843 ] Benedict commented on CASSANDRA-11327: -- This is all of course massively off topic. This particular ticket could have been closed in a tiny fraction of the time of this discussion :) > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190841#comment-15190841 ] Benedict commented on CASSANDRA-11327: -- bq. When timeouts do occur don't those also introduce additional workload amplification in the form of retries, hinted handoff, and repair? I already agree that this change would help in situations of sustained overload, over the level provided for by the cluster capacity. Just that it lowers the level at which that occurs in order to achieve that (and increases the frequency of occurrence). Perhaps I was indeed being too absolutist though, as it is sort of ridiculous how badly we cope with bulk loading. Still, we have to be careful here at least as far as defaults are concerned, as any change harms SLAs for existing clusters - although admittedly the increase in usable memtable space in each release is helping clusters, so that anyone upgrading from the 2.0 era will have plenty of headroom to introduce some behaviour like this (as will 2.1 and 2.2 for many workloads). There are lots of issues around timeouts and backpressure, and it's worth noting that this by itself is probably not sufficient. There are the following concerns: * We depend on TCP back pressure to the clients, but we test with only one client; the TCP backpressure mechanism is massively undercut when there are many such clients, as with enough send and receive buffers there will be too many messages already in flight to accommodate no matter what we do * Most clients are by default fully asynchronous, meaning if they can exceed the rate of work provision to the cluster, there's little back pressure will do anyway * The commit log is often as (or more) a cause of these spikes, when it becomes saturated; making it permit trickle progress when recovering from overload is essential for this having any impact Assuming each of these are sufficiently mitigated, the goal of simply ensuring _some_ progress is made to prevent client timeouts should presumably be possible with much lower requirements than 50% of memory allocated to this contingency. I would say that, at the very least, it should be configurable how much. Say, by default, 10% of memory is kept as contingency, so that for every 9 bytes flushed 1 byte is made available. This permits a trickle of acknowledgements to prevent timeouts, while minimising harm to SLAs. Or 35%, but with a logarithmic scaling, so that the first bytes written provide more free bytes than the last, so that latency is only gradually introduced to cope with overload. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189628#comment-15189628 ] Ariel Weisberg commented on CASSANDRA-11327: bq. Perhaps you should outline precisely the algorithm you propose, since there's a whole class of similar algorithms and it would narrow the discussion? There is probably some tuning that could be done to make this smarter, but basically if right now 1/4 of the heap is the memtable memory limit change it 1/8th (in half). Let's ignore 2i and look at just a memtable flushing. Let's say we know what the expected on disk size is as well as the number of partitions or rows and we can guess at the average weight of each partition or row. Every N partitions or rows we can update the amount of free memory to reflect the weight of what was flushed. Or we could be more precise if the tracking the weight of what is flushed isn't difficult. Peak footprint remains the same since we have cut the limit in half, but actual footprint will vary between the limit and double the limit as flushing releases memory to writers while the memory is still committed. bq. By reducing their size, transient overload becomes more frequent, and SLAs are not met or the cluster capacity must be increased. I agree this is the biggest problem. I think you are right in terms of dealing with variance in the worst case it reduces memory utilization by half, but in the average or real case maybe it's not so bad? Maybe flushing isn't super far behind it's just a little behind? bq. So I don't personally see the rationale for making transient overload (Cassandra's strong suit) worse, in exchange for a really temporary reprieve on sustained overload. I don't think we should dismiss this out of hand. I think there are users who do care about saturating load and who care about the difficulty of determining exactly how fast they can write to the database. Spark and bulk loading are both pain points. Right now it's very difficult because the database doesn't provide any notice that you are about to saturate you just start getting mass timeouts instead of backpressure. When timeouts do occur don't those also introduce additional workload amplification in the form of retries, hinted handoff, and repair? I am not completely sold that this kind of thing would cripple the ability of memtables to handle variance in arrival distribution. It reduces the window and magnitude of variance that can be tolerated certainly, but for capacity planning purposes peak throughput isn't the only factor. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189583#comment-15189583 ] Benedict commented on CASSANDRA-11327: -- Perhaps you should outline precisely the algorithm you propose, since there's a whole class of similar algorithms and it would narrow the discussion? But the statement that you are reducing the total memory available for memtables must by definition increase latency for those writes that would have been fully accommodated by the full buffer capacity (and no longer can due to artificial reduction). The only way this does not affect latency is when the cluster is overloaded - which admittedly all of our performance tests induce, despite this being completely not what Cassandra is designed for. Memtables are there to smooth out the natural variance in the message arrival distribution. A properly tuned cluster would ensure that overload occurs only some SLA frequency, say 3 sigma chance. By reducing their size, transient overload becomes more frequent, and SLAs are not met or the cluster capacity must be increased. Now, a Cassandra cluster simply _cannot_ cope with sustained overload, no matter what we do here; LSMTs seal our fate very rapidly in that situation. So I don't personally see the rationale for making transient overload (Cassandra's strong suit) worse, in exchange for a really temporary reprieve on sustained overload. bq. I wasn't aware the partially off heap and off heap memtables were able to reclaim memory incrementally during flushing. They aren't, but the patch I linked introduced this against a pre-2.1 branch. It wasn't exactly trivial to do, though (it introduced a constrained pauseless compacting GC), and it is probably better to wait until TPC to think about reattempting this. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189372#comment-15189372 ] Ariel Weisberg commented on CASSANDRA-11327: Benedict I don't follow how it adds latency? The threads are already blocked on the lack of memory. What it could be construed to do is reduce the total memory available for memtables since it's faking it via memory accounting instead of actually reclaiming memory. During saturating load all available memtable memory will be filled pretty quickly and then it will stay that way forever. >From the perspective of the user a sawtooth that doesn't go to zero is better >than a sawtooth that goes to zero for extended periods. If you are saying we >should actually reclaim the memory instead of doing it via accounting well >yeah I agree. I wasn't aware the partially off heap and off heap memtables >were able to reclaim memory incrementally during flushing. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188351#comment-15188351 ] Benedict commented on CASSANDRA-11327: -- No; they're about actually freeing the memory. The point of memtables is that they completely mask latency until you exceed write throughput by total system buffer capacity. The idea being that the cluster should always be provisioned above that level, since it's for real-time service provision. Any rate limit of the kind you describe would artificially introduce latency at all other times, i.e. when a healthy cluster would have none. Certainly there are schemes that are better than others, such as calculating the difference between allocation rate and flush rate, applying a rate limit when one exceeds the other, by an amount inversely proportional to the amount of free space (i.e. so that the latency adulteration only occurs as you approach overload). Actually reclaiming space as flush progresses has the advantage of only introducing latency only when absolutely necessary, but also ensures progress to queries at the disk throughput limit of the cluster. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188316#comment-15188316 ] Ariel Weisberg commented on CASSANDRA-11327: I'm not proposing reclaiming the space. I am just proposing easing backpressure as flushing progresses. It's an accounting change. The memory will still be fully committed until the memtable is completely flushed. Or is that an idea discussed in those threads? They all seem a bit orthogonal and are also focused on changing the data structures. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory
[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188309#comment-15188309 ] Benedict commented on CASSANDRA-11327: -- That line not only already supports it, it was intended to do this at the point of writing. In fact I'm not sure why it isn't already. The call to {{parent.hasRoom().register()}} can simply be provided a {{TimerContext}}, i.e. {{parent.hasRoom().register(TimerContext)}} As regards the incremental release of memory you're about two years late to that party - see the abandoned (but fully functioning at the time) branch [here|https://github.com/belliottsmith/cassandra/tree/6843-offheap.gc]. See the related discussion on CASSANDRA-6689, CASSANDRA-6694 and CASSANDRA-6843. Ultimately it was unpalatable to the project. Possibly with thread-per-core a more palatable approach will be viable. > Maintain a histogram of times when writes are blocked due to no available > memory > > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)