[ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051829#comment-13051829 ]
Sylvain Lebresne commented on CASSANDRA-2774: --------------------------------------------- {quote} let's say we have 4 nodes, A B C D. all the traffic we observe is, with increasing timestamp(): A leader add 1 ts=100 B leader delete ts=200 C leader add 2 ts=300 now the updates so far start to replicate to D: assume that D sees the following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is: [A:1 C:2, last_delete=200, timestamp=300] now let's all the traffic between A,B,C go through, and they fully resolve (receiving pair-wise messages and etc), so A B C all come to state: [A:nil C:2, last_delete=200 timestamp=300] now A's state and D's state are different, let's say we let A repair D, A's A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is isDelta(), so it trumps D. as a result it seems we never reach agreement between A and D, even though traffic is allowed to flow freely. {quote} This is *not* how the counter implementation works. In the implementation, only A is ever able to increment it's own clock. As a consequence, it is impossible for other nodes to have a version for A's shard that is greater than what A has. That "scenario" is not a valid scenario. Now, looking at your patch a bit more closely, I actually fail to see how it changes anything. You do impose a read during the write but it only changes the CounterColumn.timestampOfLastDelete field. Hence the code is still dependant on the merge order of sstables. To be more concrete, even on a single node cluster, suppose you do 3 writes (received in that exact order since we're considering only one node): +1 then delete then +1. The first +1 will have an initial "epoch", let's say 0, the delete will have a bigger epoch, let's say 1 and the second +1 will inherit that epoch 1. But there is nothing that forces all those updates to be in the same sstables and if they are in different sstables and the two increments are merged first, it will results in a +2 with epoch 1, that, when merge with the tombstone, will just discard it (as of the rules of your patch) and we will finally return +2 to the client. But if the merge order is different and we first merge the first increment to the tombstone and then to the second increment, the final result will be +1. Exactly as the code already does. > one way to make counter delete work better > ------------------------------------------ > > Key: CASSANDRA-2774 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2774 > Project: Cassandra > Issue Type: New Feature > Affects Versions: 0.8.0 > Reporter: Yang Yang > Attachments: counter_delete.diff > > > current Counter does not work with delete, because different merging order of > sstables would produces different result, for example: > add 1 > delete > add 2 > if the merging happens by 1-2, (1,2)--3 order, the result we see will be 2 > if merging is: 1--3, (1,3)--2, the result will be 3. > the issue is that delete now can not separate out previous adds and adds > later than the delete. supposedly a delete is to create a completely new > incarnation of the counter, or a new "lifetime", or "epoch". the new approach > utilizes the concept of "epoch number", so that each delete bumps up the > epoch number. since each write is replicated (replicate on write is almost > always enabled in practice, if this is a concern, we could further force ROW > in case of delete ), so the epoch number is global to a replica set > changes are attached, existing tests pass fine, some tests are modified since > the semantic is changed a bit. some cql tests do not pass in the original > 0.8.0 source, that's not the fault of this change. > see details at > http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E > the goal of this is to make delete work ( at least with consistent behavior, > yes in case of long network partition, the behavior is not ideal, but it's > consistent with the definition of logical clock), so that we could have > expiring Counters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira