[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-06-17 Thread Chris Lohfink (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336749#comment-15336749
 ] 

Chris Lohfink commented on CASSANDRA-11327:
---

I looked at it a few days ago, seeing metric name change now I'm +1

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-06-17 Thread Joshua McKenzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336276#comment-15336276
 ] 

Joshua McKenzie commented on CASSANDRA-11327:
-

[~cnlwsu]: you good taking review on this?

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-06-13 Thread Chris Lohfink (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328832#comment-15328832
 ] 

Chris Lohfink commented on CASSANDRA-11327:
---

Minor Nitpick/bike shedding: all other metric names are camel case, having one 
with spaces may mess up some tools (ie command line jmx readers).

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-06-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328789#comment-15328789
 ] 

Ariel Weisberg commented on CASSANDRA-11327:


||Code|utests|dtests||
|[3.0 
code|https://github.com/apache/cassandra/compare/cassandra-3.0...aweisberg:CASSANDRA-11327-3.0?expand=1]|[utests|https://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-3.0-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-3.0-dtest/]|
|[trunk 
code|https://github.com/apache/cassandra/compare/trunk...aweisberg:CASSANDRA-11327-trunk?expand=1]|[utests|https://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-trunk-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/aweisberg/job/aweisberg-CASSANDRA-11327-trunk-dtest/]|

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-13 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15192380#comment-15192380
 ] 

Ariel Weisberg commented on CASSANDRA-11327:


I think that even if we only provided a mechanism and the default policy was to 
maximize memtable utilization before backpressure that would be a big 
improvement. We could get feedback on what behavior people prefer.

It seems like the commit log should not be any more of a bottleneck than it is 
now. If the CL was able to go fast enough that it could fill up the memtables 
then it should have enough capacity to do that indefinitely since it doesn't 
really defer work like flushing or compaction.

Yes it's off topic, but I'll make sure it's copied over to an implementation 
ticket if we get there.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190843#comment-15190843
 ] 

Benedict commented on CASSANDRA-11327:
--

This is all of course massively off topic.  This particular ticket could have 
been closed in a tiny fraction of the time of this discussion :)

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190841#comment-15190841
 ] 

Benedict commented on CASSANDRA-11327:
--

bq. When timeouts do occur don't those also introduce additional workload 
amplification in the form of retries, hinted handoff, and repair?

I already agree that this change would help in situations of sustained 
overload, over the level provided for by the cluster capacity.  Just that it 
lowers the level at which that occurs in order to achieve that (and increases 
the frequency of occurrence).

Perhaps I was indeed being too absolutist though, as it is sort of ridiculous 
how badly we cope with bulk loading.  Still, we have to be careful here at 
least as far as defaults are concerned, as any change harms SLAs for existing 
clusters - although admittedly the increase in usable memtable space in each 
release is helping clusters, so that anyone upgrading from the 2.0 era will 
have plenty of headroom to introduce some behaviour like this (as will 2.1 and 
2.2 for many workloads).

There are lots of issues around timeouts and backpressure, and it's worth 
noting that this by itself is probably not sufficient. There are the following 
concerns:
* We depend on TCP back pressure to the clients, but we test with only one 
client; the TCP backpressure mechanism is massively undercut when there are 
many such clients, as with enough send and receive buffers there will be too 
many messages already in flight to accommodate no matter what we do
* Most clients are by default fully asynchronous, meaning if they can exceed 
the rate of work provision to the cluster, there's little back pressure will do 
anyway
* The commit log is often as (or more) a cause of these spikes, when it becomes 
saturated; making it permit trickle progress when recovering from overload is 
essential for this having any impact

Assuming each of these are sufficiently mitigated, the goal of simply ensuring 
_some_ progress is made to prevent client timeouts should presumably be 
possible with much lower requirements than 50% of memory allocated to this 
contingency.  I would say that, at the very least, it should be configurable 
how much.  Say, by default, 10% of memory is kept as contingency, so that for 
every 9 bytes flushed 1 byte is made available.  This permits a trickle of 
acknowledgements to prevent timeouts, while minimising harm to SLAs.  Or 35%, 
but with a logarithmic scaling, so that the first bytes written provide more 
free bytes than the last, so that latency is only gradually introduced to cope 
with overload.


> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was 

[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-10 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189628#comment-15189628
 ] 

Ariel Weisberg commented on CASSANDRA-11327:


bq. Perhaps you should outline precisely the algorithm you propose, since 
there's a whole class of similar algorithms and it would narrow the discussion?

There is probably some tuning that could be done to make this smarter, but 
basically if right now 1/4 of the heap is the memtable memory limit change it 
1/8th (in half). Let's ignore 2i and look at just a memtable flushing. Let's 
say we know what the expected on disk size is as well as the number of 
partitions or rows and we can guess at the average weight of each partition or 
row. Every N partitions or rows we can update the amount of free memory to 
reflect the weight of what was flushed. Or we could be more precise if the 
tracking the weight of what is flushed isn't difficult.

Peak footprint remains the same since we have cut the limit in half, but actual 
footprint will vary between the limit and double the limit as flushing releases 
memory to writers while the memory is still committed.

bq. By reducing their size, transient overload becomes more frequent, and SLAs 
are not met or the cluster capacity must be increased.
I agree this is the biggest problem. I think you are right in terms of dealing 
with variance in the worst case it reduces memory utilization by half, but in 
the average or real case maybe it's not so bad? Maybe flushing isn't super far 
behind it's just a little behind?

bq. So I don't personally see the rationale for making transient overload 
(Cassandra's strong suit) worse, in exchange for a really temporary reprieve on 
sustained overload.
I don't think we should dismiss this out of hand. I think there are users who 
do care about saturating load and who care about the difficulty of determining 
exactly how fast they can write to the database. Spark and bulk loading are 
both pain points. Right now it's very difficult because the database doesn't 
provide any notice that you are about to saturate you just start getting mass 
timeouts instead of backpressure.

When timeouts do occur don't those also introduce additional workload 
amplification in the form of retries, hinted handoff, and repair? I am not 
completely sold that this kind of thing would cripple the ability of memtables 
to handle variance in arrival distribution. It reduces the window and magnitude 
of variance that can be tolerated certainly, but for capacity planning purposes 
peak throughput isn't the only factor.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian 

[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189583#comment-15189583
 ] 

Benedict commented on CASSANDRA-11327:
--

Perhaps you should outline precisely the algorithm you propose, since there's a 
whole class of similar algorithms and it would narrow the discussion?

But the statement that you are reducing the total memory available for 
memtables must by definition increase latency for those writes that would have 
been fully accommodated by the full buffer capacity (and no longer can due to 
artificial reduction).  The only way this does not affect latency is when the 
cluster is overloaded - which admittedly all of our performance tests induce, 
despite this being completely not what Cassandra is designed for.

Memtables are there to smooth out the natural variance in the message arrival 
distribution.  A properly tuned cluster would ensure that overload occurs only 
some SLA frequency, say 3 sigma chance.  By reducing their size, transient 
overload becomes more frequent, and SLAs are not met or the cluster capacity 
must be increased.  Now, a Cassandra cluster simply _cannot_ cope with 
sustained overload, no matter what we do here; LSMTs seal our fate very rapidly 
in that situation.  So I don't personally see the rationale for making 
transient overload (Cassandra's strong suit) worse, in exchange for a really 
temporary reprieve on sustained overload.

bq. I wasn't aware the partially off heap and off heap memtables were able to 
reclaim memory incrementally during flushing.

They aren't, but the patch I linked introduced this against a pre-2.1 branch.  
It wasn't exactly trivial to do, though (it introduced a constrained pauseless 
compacting GC), and it is probably better to wait until TPC to think about 
reattempting this.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-10 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189372#comment-15189372
 ] 

Ariel Weisberg commented on CASSANDRA-11327:


Benedict I don't follow how it adds latency? The threads are already blocked on 
the lack of memory. What it could be construed to do is reduce the total memory 
available for memtables since it's faking it via memory accounting instead of 
actually reclaiming memory. During saturating load all available memtable 
memory will be filled pretty quickly and then it will stay that way forever.

>From the perspective of the user a sawtooth that doesn't go to zero is better 
>than a sawtooth that goes to zero for extended periods. If you are saying we 
>should actually reclaim the memory instead of doing it via accounting well 
>yeah I agree. I wasn't aware the partially off heap and off heap memtables 
>were able to reclaim memory incrementally during flushing.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188351#comment-15188351
 ] 

Benedict commented on CASSANDRA-11327:
--

No; they're about actually freeing the memory.  

The point of memtables is that they completely mask latency until you exceed 
write throughput by total system buffer capacity.  The idea being that the 
cluster should always be provisioned above that level, since it's for real-time 
service provision.  Any rate limit of the kind you describe would artificially 
introduce latency at all other times, i.e. when a healthy cluster would have 
none.  

Certainly there are schemes that are better than others, such as calculating 
the difference between allocation rate and flush rate, applying a rate limit 
when one exceeds the other, by an amount inversely proportional to the amount 
of free space (i.e. so that the latency adulteration only occurs as you 
approach overload).

Actually reclaiming space as flush progresses has the advantage of only 
introducing latency only when absolutely necessary, but also ensures progress 
to queries at the disk throughput limit of the cluster.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-09 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188316#comment-15188316
 ] 

Ariel Weisberg commented on CASSANDRA-11327:


I'm not proposing reclaiming the space. I am just proposing easing backpressure 
as flushing progresses. It's an accounting change. The memory will still be 
fully committed until the memtable is completely flushed. Or is that an idea 
discussed in those threads? They all seem a bit orthogonal and are also focused 
on changing the data structures.

> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11327) Maintain a histogram of times when writes are blocked due to no available memory

2016-03-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188309#comment-15188309
 ] 

Benedict commented on CASSANDRA-11327:
--

That line not only already supports it, it was intended to do this at the point 
of writing.  In fact I'm not sure why it isn't already.  

The call to {{parent.hasRoom().register()}} can simply be provided a 
{{TimerContext}}, i.e. {{parent.hasRoom().register(TimerContext)}}

As regards the incremental release of memory you're about two years late to 
that party - see the abandoned (but fully functioning at the time) branch 
[here|https://github.com/belliottsmith/cassandra/tree/6843-offheap.gc]. See the 
related discussion on CASSANDRA-6689, CASSANDRA-6694 and CASSANDRA-6843.  
Ultimately it was unpalatable to the project.  Possibly with thread-per-core a 
more palatable approach will be viable.




> Maintain a histogram of times when writes are blocked due to no available 
> memory
> 
>
> Key: CASSANDRA-11327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11327
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Core
>Reporter: Ariel Weisberg
>
> I have a theory that part of the reason C* is so sensitive to timeouts during 
> saturating write load is that throughput is basically a sawtooth with valleys 
> at zero. This is something I have observed and it gets worse as you add 2i to 
> a table or do anything that decreases the throughput of flushing.
> I think the fix for this is to incrementally release memory pinned by 
> memtables and 2i during flushing instead of releasing it all at once. I know 
> that's not really possible, but we can fake it with memory accounting that 
> tracks how close to completion flushing is and releases permits for 
> additional memory. This will lead to a bit of a sawtooth in real memory 
> usage, but we can account for that so the peak footprint is the same.
> I think the end result of this change will be a sawtooth, but the valley of 
> the sawtooth will not be zero it will be the rate at which flushing 
> progresses. Optimizing the rate at which flushing progresses and it's 
> fairness with other work can then be tackled separately.
> Before we do this I think we should demonstrate that pinned memory due to 
> flushing is actually the issue by getting better visibility into the 
> distribution of instances of not having any memory by maintaining a histogram 
> of spans of time where no memory is available and a thread is blocked.
> [MemtableAllocatr$SubPool.allocate(long)|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java#L186]
>  should be a relatively straightforward entry point for this. The first 
> thread to block can mark the start of memory starvation and the last thread 
> out can mark the end. Have a periodic task that tracks the amount of time 
> spent blocked per interval of time and if it is greater than some threshold 
> log with more details, possibly at debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)