[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750908#comment-15750908 ]
Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:34 AM: --------------------------------------------------------------------- I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} * Calculates the likeliness of a race condition leading to unnecessary repairs * @see https://issues.apache.org/jira/browse/CASSANDRA-12991 * * This assumes that all writes are equally spread over all token ranges * and there is one subrange repair executed for each existing token range 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can only define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. was (Author: brstgt): I created a little script to calculate some possible scenarios: https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99 Output: {quote} 3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 0.39% Unnecessary range repairs per repair: 3.00 3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000 req/s Total Ranges: 768 Likeliness for RC per range: 1.56% Unnecessary range repairs per repair: 12.00 8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 2.93% Unnecessary range repairs per repair: 60.00 8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000 req/s Total Ranges: 2048 Likeliness for RC per range: 5.37% Unnecessary range repairs per repair: 110.00 {quote} You may ask why I entered latencies like 10ms or 20ms - this seems quite high. It is indeed quite high for regular tables and a cluster that is not overloaded. Under these conditions, the latency is dominated by your network latency, so 1ms seems quite fair to me. As soon as you use MVs and your cluster tends to overload, higher latencies are not unrealistic. You have to take into account that an MV operation does read before write and the latency may vary very much. For MVs the latency is not (only) any more dominated by network latency but by MV lock aquisition and read before write. Both factors can introduce MUCH higher latencies, depending on concurrent operations on MV, number of SSTables, compaction strategy, just everything that affects read performance. If your cluster is overloaded, these effects have an even higher impact. I observed MANY situations on our production system where writes timed out during streaming because of lock contention and or RBW impacts. These situations mainly pop up during repair sessions when streams cause bulk mutation applies (see StreamReceiverTask path for MVs). Impact is even higher due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the situation even more unpredictable and increases "drifts" of nodes, like Node A is overloaded but Node B not because Node A receives a stream from a different repair but Node B does not. This is a vicious circle driven several factors: - Stream puts pressure on nodes - especially larg(er) partitions - hints tend to queue up - hint delivery puts more pressure - retransmission of failed hint delivery puts even more pressure - latencies go up - stream validations drift - more (unnecessary) streams - goto 0 This calculation example is just hypothetic. This *may* happen as calculated but it totally depends on the model, cluster dimensions, cluster load, write activity, distribution of writes and repair execution. I don't claim that fixing this issue will remove all MV performance problems but it may be helps to remove one impediment in the mentioned vicious circle. My proposal is NOT to control flushes. This is far too complicated and wont help. A flush, whenever it may happen and whatever range it flushes may or may not contain a mutation that _should_ be there. The only thing that helps is to cut off all data retrospectively at a synchronized and fix timestamp when executing the validation. You can only define a grace period (GP). When you start validation at VS on the repair coordinator, then you expect all mutations to arrive no later than VS that were created before VS - GP. That can IMHO only be done at SSTable scanner level by filtering all events (cells, tombstones) after VS - GP during validation compaction. Something like the opposite of purging tombstones after GCGS. > Inter-node race condition in validation compaction > -------------------------------------------------- > > Key: CASSANDRA-12991 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12991 > Project: Cassandra > Issue Type: Improvement > Reporter: Benjamin Roth > Priority: Minor > > Problem: > When a validation compaction is triggered by a repair it may happen that due > to flying in mutations the merkle trees differ but the data is consistent > however. > Example: > t = 10000: > Repair starts, triggers validations > Node A starts validation > t = 10001: > Mutation arrives at Node A > t = 10002: > Mutation arrives at Node B > t = 10003: > Node B starts validation > Hashes of node A+B will differ but data is consistent from a view (think of > it like a snapshot) t = 10000. > Impact: > Unnecessary streaming happens. This may not a big impact on low traffic CFs, > partitions but on high traffic CFs and maybe very big partitions, this may > have a bigger impact and is a waste of resources. > Possible solution: > Build hashes based upon a snapshot timestamp. > This requires SSTables created after that timestamp to be filtered when doing > a validation compaction: > - Cells with timestamp > snapshot time have to be removed > - Tombstone range markers have to be handled > - Bounds have to be removed if delete timestamp > snapshot time > - Boundary markers have to be either changed to a bound or completely > removed, depending if start and/or end are both affected or not > Probably this is a known behaviour. Have there been any discussions about > this in the past? Did not find an matching issue, so I created this one. > I am happy about any feedback, whatsoever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)