[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730053#comment-15730053
 ] 

Stefan Podkowinski commented on CASSANDRA-12991:
------------------------------------------------


Please keep in mind that there can be concurrent validation compactions at the 
same time (for different token ranges and tables). See CASSANDRA-9491 for an 
overview on how repair units are created and how bad this behaviour will effect 
sequential repairs. Coordinating flushes based on a timestamp would also 
require to only flush certain token ranges and in return would leave you with 
lots of tiny sstables in the beginning.

I'd also expect ValidationRequest messages to be handled within at most a few 
milliseconds in normal cases. A single replica node set would have to ingest a 
substantial amount of writes to make the scenario described in the ticket 
happen within this small timeframe. Even so, how bad would it be to repair a 
couple of keys that are affected by this? We really should get some data here 
and do the math before making the repair process even more complex, while 
applying minor optimizations that will not make a difference for 90% of 
clusters anyways. 

Maybe as a start we should simply add a timestamp for the ValidationRequest as 
suggested, but only log the time difference upon validation compaction on the 
remote node. We could also log a warning if the interval gets to larger than a 
certain threshold value.

> Inter-node race condition in validation compaction
> --------------------------------------------------
>
>                 Key: CASSANDRA-12991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Roth
>            Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due 
> to flying in mutations the merkle trees differ but the data is consistent 
> however.
> Example:
> t = 10000: 
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of 
> it like a snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs, 
> partitions but on high traffic CFs and maybe very big partitions, this may 
> have a bigger impact and is a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing 
> a validation compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
>  - Bounds have to be removed if delete timestamp > snapshot time
>  - Boundary markers have to be either changed to a bound or completely 
> removed, depending if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about 
> this in the past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to