[ https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599444#comment-16599444 ]
Jordan West edited comment on CASSANDRA-14145 at 9/1/18 1:16 AM: ----------------------------------------------------------------- Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag: if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} was (Author: jrwest): Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag (if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} > Detecting data resurrection during read > ---------------------------------------- > > Key: CASSANDRA-14145 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14145 > Project: Cassandra > Issue Type: Improvement > Reporter: sankalp kohli > Assignee: Sam Tunnicliffe > Priority: Minor > Fix For: 4.x > > > We have seen several bugs in which deleted data gets resurrected. We should > try to see if we can detect this on the read path and possibly fix it. Here > are a few examples which brought back data > A replica lost an sstable on startup which caused one replica to lose the > tombstone and not the data. This tombstone was past gc grace which means this > could resurrect data. We can detect such invalid states by looking at other > replicas. > If we are running incremental repair, Cassandra will keep repaired and > non-repaired data separate. Every-time incremental repair will run, it will > move the data from non-repaired to repaired. Repaired data across all > replicas should be 100% consistent. > Here is an example of how we can detect and mitigate the issue in most cases. > Say we have 3 machines, A,B and C. All these machines will have data split > b/w repaired and non-repaired. > 1. Machine A due to some bug bring backs data D. This data D is in repaired > dataset. All other replicas will have data D and tombstone T > 2. Read for data D comes from application which involve replicas A and B. The > data being read involves data which is in repaired state. A will respond > back to co-ordinator with data D and B will send nothing as tombstone is past > gc grace. This will cause digest mismatch. > 3. This patch will only kick in when there is a digest mismatch. Co-ordinator > will ask both replicas to send back all data like we do today but with this > patch, replicas will respond back what data it is returning is coming from > repaired vs non-repaired. If data coming from repaired does not match, we > know there is a something wrong!! At this time, co-ordinator cannot determine > if replica A has resurrected some data or replica B has lost some data. We > can still log error in the logs saying we hit an invalid state. > 4. Besides the log, we can take this further and even correct the response to > the query. After logging an invalid state, we can ask replica A and B (and > also C if alive) to send back all data for this including gcable tombstones. > If any machine returns a tombstone which is after this data, we know we > cannot return this data. This way we can avoid returning data which has been > deleted. > Some Challenges with this > 1. When data will be moved from non-repaired to repaired, there could be a > race here. We can look at which incremental repairs have promoted things on > which replica to avoid false positives. > 2. If the third replica is down and live replica does not have any tombstone, > we wont be able to break the tie in deciding whether data was actually > deleted or resurrected. > 3. If the read is for latest data only, we wont be able to detect it as the > read will be served from non-repaired data. > 4. If the replica where we lose a tombstone is the last replica to compact > the tombstone, we wont be able to decide if data is coming back or rest of > the replicas has lost that data. But we will still detect something is wrong. > 5. We wont affect 99.9% of the read queries as we only do extra work during > digest mismatch. > 6. CL.ONE reads will not be able to detect this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org