[ 
https://issues.apache.org/jira/browse/CASSANDRA-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328373#comment-15328373
 ] 

Ariel Weisberg edited comment on CASSANDRA-7937 at 6/13/16 9:51 PM:
--------------------------------------------------------------------

I think we can make this situation better, and I mentioned some ideas at NGCC 
and in CASSANDRA-11327.

There are two issues.

The first is that if flushing falls behind throughput falls to zero instead of 
progressing at the rate at which flushing progresses which is usually not zero. 
Right now it looks like it is zero because flushing doesn't release any memory 
as it progresses and is all or nothing.

Aleksey mentioned we could do something like early opening for flushing so that 
memory is made available sooner. Alternatively we could overcommit and then 
gradually release memory as flushing progresses.

The second, and this isn't really related to backpressure, is that flushing 
falls behind in several reasonable configurations. Ingest has gotten faster and 
I don't think flushing has as much so it's easier for it to fall behind if it's 
driven by a single thread against a busy device (even a fast SSD). I haven't 
tested this yet, but I suspect that if you use multiple JBOD paths for a fast 
device like an SSD and increase memtable_flush_writers you will get enough 
additional flushing throughput to keep up with ingest. Right now flushing is 
single threaded for a single path and only one flush can occur at any time.

Flushing falling behind is more noticeable when you let compaction have more 
threads and a bigger rate limit because it can dirty enough memory in the 
filesystem cache that when it flushes it causes a temporally localized slowdown 
in flushing that is enough to cause timeouts when there is no more free memory 
because flushing didn't finish soon enough.

I think the long term solution is that the further flushing falls behind the 
more concurrent flush threads we start deploying kind of like compaction up to 
the configured limit. [Right now there is a single thread scheduling them and 
waiting on the 
result.|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130]
 memtable_flush_writers doesn't help due to this code here that only [generates 
more flush runnables for a memtable if there are multiple 
directories|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/Memtable.java#L278].
 C* is already divvying up the heap using memtable_cleanup_threshold which 
would allow for concurrent flushing it's just not actually flushing 
concurrently.


was (Author: aweisberg):
I think we can make this situation better, and I mentioned some ideas at NGCC 
and in CASSANDRA-11327.

There are two issues.

The first is that if flushing falls behind throughput falls to zero instead of 
progressing at the rate at which flushing progresses which is usually not zero. 
Right now it looks like it is zero because flushing doesn't release any memory 
as it progresses and is all or nothing.

Aleksey mentioned we could do something like early opening for flushing so that 
memory is made available sooner. Alternatively we could overcommit and then 
gradually release memory as flushing progresses.

The second is that flushing falls behind in several reasonable configurations. 
Ingest has gotten faster and I don't think flushing has as much so it's easier 
for it to fall behind if it's driven by a single thread against a busy device 
(even a fast SSD). I haven't tested this yet, but I suspect that if you use 
multiple JBOD paths for a fast device like an SSD and increase 
memtable_flush_writers you will get enough additional flushing throughput to 
keep up with ingest. Right now flushing is single threaded for a single path 
and only one flush can occur at any time.

Flushing falling behind is more noticeable when you let compaction have more 
threads and a bigger rate limit because it can dirty enough memory in the 
filesystem cache that when it flushes it causes a temporally localized slowdown 
in flushing that is enough to cause timeouts when there is no more free memory 
because flushing didn't finish soon enough.

I think the long term solution is that the further flushing falls behind the 
more concurrent flush threads we start deploying kind of like compaction up to 
the configured limit. [Right now there is a single thread scheduling them and 
waiting on the 
result.|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1130]
 memtable_flush_writers doesn't help due to this code here that only [generates 
more flush runnables for a memtable if there are multiple 
directories|https://github.com/apache/cassandra/blob/cassandra-3.7/src/java/org/apache/cassandra/db/Memtable.java#L278].
 C* is already divvying up the heap using memtable_cleanup_threshold which 
would allow for concurrent flushing it's just not actually flushing 
concurrently.

> Apply backpressure gently when overloaded with writes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-7937
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7937
>             Project: Cassandra
>          Issue Type: Improvement
>         Environment: Cassandra 2.0
>            Reporter: Piotr Kołaczkowski
>              Labels: performance
>
> When writing huge amounts of data into C* cluster from analytic tools like 
> Hadoop or Apache Spark, we can see that often C* can't keep up with the load. 
> This is because analytic tools typically write data "as fast as they can" in 
> parallel, from many nodes and they are not artificially rate-limited, so C* 
> is the bottleneck here. Also, increasing the number of nodes doesn't really 
> help, because in a collocated setup this also increases number of 
> Hadoop/Spark nodes (writers) and although possible write performance is 
> higher, the problem still remains.
> We observe the following behavior:
> 1. data is ingested at an extreme fast pace into memtables and flush queue 
> fills up
> 2. the available memory limit for memtables is reached and writes are no 
> longer accepted
> 3. the application gets hit by "write timeout", and retries repeatedly, in 
> vain 
> 4. after several failed attempts to write, the job gets aborted 
> Desired behaviour:
> 1. data is ingested at an extreme fast pace into memtables and flush queue 
> fills up
> 2. after exceeding some memtable "fill threshold", C* applies adaptive rate 
> limiting to writes - the more the buffers are filled-up, the less writes/s 
> are accepted, however writes still occur within the write timeout.
> 3. thanks to slowed down data ingestion, now flush can finish before all the 
> memory gets used
> Of course the details how rate limiting could be done are up for a discussion.
> It may be also worth considering putting such logic into the driver, not C* 
> core, but then C* needs to expose at least the following information to the 
> driver, so we could calculate the desired maximum data rate:
> 1. current amount of memory available for writes before they would completely 
> block
> 2. total amount of data queued to be flushed and flush progress (amount of 
> data to flush remaining for the memtable currently being flushed)
> 3. average flush write speed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to