[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction

Benjamin Roth (JIRA) Thu, 15 Dec 2016 01:35:37 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750908#comment-15750908
 ]


Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:34 AM:
---------------------------------------------------------------------

I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
 * Calculates the likeliness of a race condition leading to unnecessary repairs
 * @see https://issues.apache.org/jira/browse/CASSANDRA-12991
 *
 * This assumes that all writes are equally spread over all token ranges
 * and there is one subrange repair executed for each existing token range

3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. The only thing that helps is to 
cut off all data retrospectively at a synchronized and fix timestamp when 
executing the validation. You can only define a grace period (GP). When you 
start validation at VS on the repair coordinator, then you expect all mutations 
to arrive no later than VS that were created before VS - GP. That can IMHO only 
be done at SSTable scanner level by filtering all events (cells, tombstones) 
after VS - GP during validation compaction. Something like the opposite of 
purging tombstones after GCGS.


was (Author: brstgt):
I created a little script to calculate some possible scenarios: 
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99

Output:
{quote}
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00

3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00

8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00

8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}

You may ask why I entered latencies like 10ms or 20ms - this seems quite high. 
It is indeed quite high for regular tables and a cluster that is not 
overloaded. Under these conditions, the latency is dominated by your network 
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are 
not unrealistic.
You have to take into account that an MV operation does read before write and 
the latency may vary very much. For MVs the latency is not (only) any more 
dominated by network latency but by MV lock aquisition and read before write. 
Both factors can introduce MUCH higher latencies, depending on concurrent 
operations on MV, number of SSTables, compaction strategy, just everything that 
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.

I observed MANY situations on our production system where writes timed out 
during streaming because of lock contention and or RBW impacts. These 
situations mainly pop up during repair sessions when streams cause bulk 
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher 
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the 
situation even more unpredictable and increases "drifts" of nodes, like Node A 
is overloaded but Node B not because Node A receives a stream from a different 
repair but Node B does not.

This is a vicious circle driven several factors: 
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0

This calculation example is just hypothetic. This *may* happen as calculated 
but it totally depends on the model, cluster dimensions, cluster load, write 
activity, distribution of writes and repair execution. I don't claim that 
fixing this issue will remove all MV performance problems but it may be helps 
to remove one impediment in the mentioned vicious circle.

My proposal is NOT to control flushes. This is far too complicated and wont 
help. A flush, whenever it may happen and whatever range it flushes may or may 
not contain a mutation that _should_ be there. The only thing that helps is to 
cut off all data retrospectively at a synchronized and fix timestamp when 
executing the validation. You can only define a grace period (GP). When you 
start validation at VS on the repair coordinator, then you expect all mutations 
to arrive no later than VS that were created before VS - GP. That can IMHO only 
be done at SSTable scanner level by filtering all events (cells, tombstones) 
after VS - GP during validation compaction. Something like the opposite of 
purging tombstones after GCGS.

> Inter-node race condition in validation compaction
> --------------------------------------------------
>
>                 Key: CASSANDRA-12991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Roth
>            Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due 
> to flying in mutations the merkle trees differ but the data is consistent 
> however.
> Example:
> t = 10000: 
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of 
> it like a snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs, 
> partitions but on high traffic CFs and maybe very big partitions, this may 
> have a bigger impact and is a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing 
> a validation compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
>  - Bounds have to be removed if delete timestamp > snapshot time
>  - Boundary markers have to be either changed to a bound or completely 
> removed, depending if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about 
> this in the past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-12991) Inter-node race condition in validation compaction

Reply via email to