[ https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190841#comment-15190841 ]
Benedict commented on CASSANDRA-11327: -------------------------------------- bq. When timeouts do occur don't those also introduce additional workload amplification in the form of retries, hinted handoff, and repair? I already agree that this change would help in situations of sustained overload, over the level provided for by the cluster capacity. Just that it lowers the level at which that occurs in order to achieve that (and increases the frequency of occurrence). Perhaps I was indeed being too absolutist though, as it is sort of ridiculous how badly we cope with bulk loading. Still, we have to be careful here at least as far as defaults are concerned, as any change harms SLAs for existing clusters - although admittedly the increase in usable memtable space in each release is helping clusters, so that anyone upgrading from the 2.0 era will have plenty of headroom to introduce some behaviour like this (as will 2.1 and 2.2 for many workloads). There are lots of issues around timeouts and backpressure, and it's worth noting that this by itself is probably not sufficient. There are the following concerns: * We depend on TCP back pressure to the clients, but we test with only one client; the TCP backpressure mechanism is massively undercut when there are many such clients, as with enough send and receive buffers there will be too many messages already in flight to accommodate no matter what we do * Most clients are by default fully asynchronous, meaning if they can exceed the rate of work provision to the cluster, there's little back pressure will do anyway * The commit log is often as (or more) a cause of these spikes, when it becomes saturated; making it permit trickle progress when recovering from overload is essential for this having any impact Assuming each of these are sufficiently mitigated, the goal of simply ensuring _some_ progress is made to prevent client timeouts should presumably be possible with much lower requirements than 50% of memory allocated to this contingency. I would say that, at the very least, it should be configurable how much. Say, by default, 10% of memory is kept as contingency, so that for every 9 bytes flushed 1 byte is made available. This permits a trickle of acknowledgements to prevent timeouts, while minimising harm to SLAs. Or 35%, but with a logarithmic scaling, so that the first bytes written provide more free bytes than the last, so that latency is only gradually introduced to cope with overload. > Maintain a histogram of times when writes are blocked due to no available > memory > -------------------------------------------------------------------------------- > > Key: CASSANDRA-11327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11327 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Ariel Weisberg > > I have a theory that part of the reason C* is so sensitive to timeouts during > saturating write load is that throughput is basically a sawtooth with valleys > at zero. This is something I have observed and it gets worse as you add 2i to > a table or do anything that decreases the throughput of flushing. > I think the fix for this is to incrementally release memory pinned by > memtables and 2i during flushing instead of releasing it all at once. I know > that's not really possible, but we can fake it with memory accounting that > tracks how close to completion flushing is and releases permits for > additional memory. This will lead to a bit of a sawtooth in real memory > usage, but we can account for that so the peak footprint is the same. > I think the end result of this change will be a sawtooth, but the valley of > the sawtooth will not be zero it will be the rate at which flushing > progresses. Optimizing the rate at which flushing progresses and it's > fairness with other work can then be tackled separately. > Before we do this I think we should demonstrate that pinned memory due to > flushing is actually the issue by getting better visibility into the > distribution of instances of not having any memory by maintaining a histogram > of spans of time where no memory is available and a thread is blocked. > [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186] > should be a relatively straightforward entry point for this. The first > thread to block can mark the start of memory starvation and the last thread > out can mark the end. Have a periodic task that tracks the amount of time > spent blocked per interval of time and if it is greater than some threshold > log with more details, possibly at debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)