[ 
https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593851#comment-16593851
 ] 

Marcus Eriksson commented on CASSANDRA-14145:
---------------------------------------------

pushed a branch [here|https://github.com/krummas/cassandra/commits/sam/14145] 
with initial comments:
* Only allocate repairedIterators if needed
* Since sstables can jump from pending to repaired at any time, we need to 
check isRepaired last (and grab the pending id to avoid NPE)
* Skip the unrepaired sstables when we are adding all the repaired sstables for 
digest calculation
* in {{considerRepairedForTracking}} we should only return true if we are 
tracking repaired status
* A few inline comments/questions
** can we really break if reduceFilter returns null in 
SinglePartitionReadCommand?
** Don't think WithDigestAndLowerBound actually works? iterator will not be 
instanceof IteratorWithLowerBound after merging right? It should probably have 
dedicated tests as well once fixed
** Could we keep the old behaviour in 
{{queryMemtableAndSSTablesInTimestampOrder}} (ie, don't collect the iterators, 
just build the result) if not tracking repaired status, code might get a lot 
uglier though
** in {{DefaultRepairedDataInfo}} is it worth keeping around a set of pending 
repairs when the only use (outside of tests) is to check if it is empty?

Other thoughts;
* We should probably add metrics for how many extra sstables are touched due to 
tracking repaired digests
* It would be nice if we had Tracing info for these digest comparisons
* Could we enable this globally for dtest runs? If there is a dtest that on 
purpose messes up the repaired data we can disable this feature for those tests
* A couple of end-to-end dtests should probably be added

>  Detecting data resurrection during read
> ----------------------------------------
>
>                 Key: CASSANDRA-14145
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14145
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: Sam Tunnicliffe
>            Priority: Minor
>             Fix For: 4.x
>
>
> We have seen several bugs in which deleted data gets resurrected. We should 
> try to see if we can detect this on the read path and possibly fix it. Here 
> are a few examples which brought back data
> A replica lost an sstable on startup which caused one replica to lose the 
> tombstone and not the data. This tombstone was past gc grace which means this 
> could resurrect data. We can detect such invalid states by looking at other 
> replicas. 
> If we are running incremental repair, Cassandra will keep repaired and 
> non-repaired data separate. Every-time incremental repair will run, it will 
> move the data from non-repaired to repaired. Repaired data across all 
> replicas should be 100% consistent. 
> Here is an example of how we can detect and mitigate the issue in most cases. 
> Say we have 3 machines, A,B and C. All these machines will have data split 
> b/w repaired and non-repaired. 
> 1. Machine A due to some bug bring backs data D. This data D is in repaired 
> dataset. All other replicas will have data D and tombstone T 
> 2. Read for data D comes from application which involve replicas A and B. The 
> data being read involves data which is in repaired state.  A will respond 
> back to co-ordinator with data D and B will send nothing as tombstone is past 
> gc grace. This will cause digest mismatch. 
> 3. This patch will only kick in when there is a digest mismatch. Co-ordinator 
> will ask both replicas to send back all data like we do today but with this 
> patch, replicas will respond back what data it is returning is coming from 
> repaired vs non-repaired. If data coming from repaired does not match, we 
> know there is a something wrong!! At this time, co-ordinator cannot determine 
> if replica A has resurrected some data or replica B has lost some data. We 
> can still log error in the logs saying we hit an invalid state.
> 4. Besides the log, we can take this further and even correct the response to 
> the query. After logging an invalid state, we can ask replica A and B (and 
> also C if alive) to send back all data for this including gcable tombstones. 
> If any machine returns a tombstone which is after this data, we know we 
> cannot return this data. This way we can avoid returning data which has been 
> deleted. 
> Some Challenges with this 
> 1. When data will be moved from non-repaired to repaired, there could be a 
> race here. We can look at which incremental repairs have promoted things on 
> which replica to avoid false positives.  
> 2. If the third replica is down and live replica does not have any tombstone, 
> we wont be able to break the tie in deciding whether data was actually 
> deleted or resurrected. 
> 3. If the read is for latest data only, we wont be able to detect it as the 
> read will be served from non-repaired data. 
> 4. If the replica where we lose a tombstone is the last replica to compact 
> the tombstone, we wont be able to decide if data is coming back or rest of 
> the replicas has lost that data. But we will still detect something is wrong. 
> 5. We wont affect 99.9% of the read queries as we only do extra work during 
> digest mismatch.
> 6. CL.ONE reads will not be able to detect this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to