[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

Benedict (JIRA) Fri, 11 Mar 2016 03:57:07 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190841#comment-15190841
 ]


Benedict commented on CASSANDRA-11327:
--------------------------------------

bq. When timeouts do occur don't those also introduce additional workload 
amplification in the form of retries, hinted handoff, and repair?

I already agree that this change would help in situations of sustained 
overload, over the level provided for by the cluster capacity.  Just that it 
lowers the level at which that occurs in order to achieve that (and increases 
the frequency of occurrence).

Perhaps I was indeed being too absolutist though, as it is sort of ridiculous 
how badly we cope with bulk loading.  Still, we have to be careful here at 
least as far as defaults are concerned, as any change harms SLAs for existing 
clusters - although admittedly the increase in usable memtable space in each 
release is helping clusters, so that anyone upgrading from the 2.0 era will 
have plenty of headroom to introduce some behaviour like this (as will 2.1 and 
2.2 for many workloads).

There are lots of issues around timeouts and backpressure, and it's worth 
noting that this by itself is probably not sufficient. There are the following 
concerns:
* We depend on TCP back pressure to the clients, but we test with only one 
client; the TCP backpressure mechanism is massively undercut when there are 
many such clients, as with enough send and receive buffers there will be too 
many messages already in flight to accommodate no matter what we do
* Most clients are by default fully asynchronous, meaning if they can exceed 
the rate of work provision to the cluster, there's little back pressure will do 
anyway
* The commit log is often as (or more) a cause of these spikes, when it becomes 
saturated; making it permit trickle progress when recovering from overload is 
essential for this having any impact

Assuming each of these are sufficiently mitigated, the goal of simply ensuring 
_some_ progress is made to prevent client timeouts should presumably be 
possible with much lower requirements than 50% of memory allocated to this 
contingency.  I would say that, at the very least, it should be configurable 
how much.  Say, by default, 10% of memory is kept as contingency, so that for 
every 9 bytes flushed 1 byte is made available.  This permits a trickle of 
acknowledgements to prevent timeouts, while minimising harm to SLAs.  Or 35%, 
but with a logarithmic scaling, so that the first bytes written provide more 
free bytes than the last, so that latency is only gradually introduced to cope 
with overload.


> Maintain a histogram of times when writes are blocked due to no available 
> memory
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11327
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

Reply via email to