[ https://issues.apache.org/jira/browse/CASSANDRA-14592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587593#comment-16587593 ]
Benedict commented on CASSANDRA-14592: -------------------------------------- Pushed an update that addresses (I think, it's been a while) Aleksey's offline review comments. We collaborated to modify the reconcile semantics a little further, so that reconciliation is as consistent as possible. Now the only situations that might arise with inconsistent reconciliation occur when one cell is expiring, another is a tombstone, and only at the point where both are logically a tombstone. Specifically, we now prefer: # The most recent timestamp # If either are a tombstone or expiring ## If one is regular, select the tombstone or expiring ## If one is expiring, select the tombstone ## The most recent deletion time # The highest value (by raw ByteBuffer comparison) > Reconcile should not be dependent on nowInSec > --------------------------------------------- > > Key: CASSANDRA-14592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14592 > Project: Cassandra > Issue Type: Bug > Reporter: Benedict > Assignee: Benedict > Priority: Major > Fix For: 4.0 > > > To have the arrival time of a mutation on a replica determine the > reconciliation priority seems to provide for unintuitive database behaviour. > It seems we should formalise our reconciliation logic in a manner that does > not depend on this, and modify our internal APIs to prevent this dependency. > > Take the following example, where both writes have the same timestamp: > > Write X with a value A, TTL of 1s > Write Y with a value B, no TTL > > If X and Y arrive on replicas in < 1s, X and Y are both live, so record Y > wins the reconciliation. The value B appears in the database. > However, if X and Y arrive on replicas in > 1s, X is now (effectively) a > tombstone. This wins the reconciliation race, and NO value is the result. > > Note that the weirdness of this is more pronounced than it might first > appear. If write X gets stuck in hints for a period on the coordinator to > one replica, the value B appears in the database until the hint is replayed. > So now we’re in a very uncertain state - will hints get replayed or not? If > they do, the value B will disappear; if they don’t it won’t. This is despite > a QUORUM of replicas ACKing both writes, and a QUORUM of readers being > engaged on read; the database still changes state to the user suddenly at > some arbitrary future point in time. > > It seems to me that a simple solution to this, is to permit TTL’d data to > always win a reconciliation against non-TTL’d data (of same timestamp), so > that we are consistent across TTLs being transformed into tombstones. > > 4.0 seems like a good opportunity to fix this behaviour, and mention in > CHANGES.txt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org