[ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049834#comment-13049834 ]
Sylvain Lebresne commented on CASSANDRA-2774: --------------------------------------------- bq. I think with quorum delete you will guarantee timing to be consistent eoyh client And then achieve client expected result I. Your Case, id like to hear your counter example Consider a cluster with RF=3 and counter c replicated on node A, B and C. Consider that all operation are done by the same client connected to some other node (doesn't have to be the same each time but can be). All operations are performed at QUORUM consistency level. The client does the following operations: # increment c by 1 # delete c # increment c by 1 # reads c Because QUORUM is 2, depending on internal timings (latency on the wire and such), either only 2 or the 3 nodes will have seen each write once it is acked to the client. Again, for the same inputs and depending on timing, the client could get on the read a variety of results: * 1 if each node have received each operation in the order issued. * 0 or 2, if for instance, by the time the read is issued: ** the first increment only reached B and C ** the deletion only reached A and C ** the second increment only reached A and B and it happens that the two first node answering the read are B and C. The exact value depends on the exact rules for dealing with the epoch number, but in any case, B would only have the two increments and C would have the first increment and deletion (issued after the increment, so the deletion wins). So B will answer 2 and C will answer a tombstone. Whatever resolution the coordinator does, it just cannot return 1 that time. > one way to make counter delete work better > ------------------------------------------ > > Key: CASSANDRA-2774 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2774 > Project: Cassandra > Issue Type: New Feature > Affects Versions: 0.8.0 > Reporter: Yang Yang > Attachments: counter_delete.diff > > > current Counter does not work with delete, because different merging order of > sstables would produces different result, for example: > add 1 > delete > add 2 > if the merging happens by 1-2, (1,2)--3 order, the result we see will be 2 > if merging is: 1--3, (1,3)--2, the result will be 3. > the issue is that delete now can not separate out previous adds and adds > later than the delete. supposedly a delete is to create a completely new > incarnation of the counter, or a new "lifetime", or "epoch". the new approach > utilizes the concept of "epoch number", so that each delete bumps up the > epoch number. since each write is replicated (replicate on write is almost > always enabled in practice, if this is a concern, we could further force ROW > in case of delete ), so the epoch number is global to a replica set > changes are attached, existing tests pass fine, some tests are modified since > the semantic is changed a bit. some cql tests do not pass in the original > 0.8.0 source, that's not the fault of this change. > see details at > http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E > the goal of this is to make delete work ( at least with consistent behavior, > yes in case of long network partition, the behavior is not ideal, but it's > consistent with the definition of logical clock), so that we could have > expiring Counters -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira