[ 
https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049688#comment-13049688
 ] 

Sylvain Lebresne commented on CASSANDRA-2774:
---------------------------------------------

Consider 2 nodes A, B and C with RF=2 and a given counter c whose replica set 
is {B, C}.
Consider a single client issuing the following operations (in order) while 
connected to node A:
# client increment c by +2 at CL.ONE
# client delete c at CL.ONE
# client increment c by +3 at CL.ONE
# client reads c at CL.ALL

The *only* valid answer the client should ever get on its last read is 3.  Any 
other value is a break of the consistency level contract and not something we 
can expect people to be happy with. Any other answer means that deletes are 
broken (and this *is* the problem with the actual implementation).

However, because the write are made at CL.ONE in the example above, at the time 
the read is issued, the only thing we know for sure is that each write has been 
received by one node, but not necessarily the same each time.  Depending on the 
actual timing and on which node happens to be the one acknowledging each 
writes, when the read reaches the nodes you can have a lot of different 
situations including:
* A and B both have received the 3 writes in the right order, they will all 
return 3, the 'right' answer.
* A received the deletion (the two increments are still on the wire yet to be 
received) and B received the other two increments (the delete is still on the 
wire yet to be received). A will return the tombstone, B will return 5. You can 
assign all epoch number you want, there is no way you can return 3 to the 
client. It will be either 5 or 0.

So the same query will result in different answers depending on the internal 
timing of events, and will sometimes return an answer that is a break of the 
contract. Removes of counters are broken and the only safe way to use them is 
for permanent removal with no following inserts. This patch doesn't fix it.

Btw, it's not too hard to come up with the same kind of example using only 
QUORUM reads and writes (but you'll need one more replica and a few more steps).


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of 
> sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds 
> later than the delete. supposedly a delete is to create a completely new 
> incarnation of the counter, or a new "lifetime", or "epoch". the new approach 
> utilizes the concept of "epoch number", so that each delete bumps up the 
> epoch number. since each write is replicated (replicate on write is almost 
> always enabled in practice, if this is a concern, we could further force ROW 
> in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since 
> the semantic is changed a bit. some cql tests do not pass in the original 
> 0.8.0 source, that's not the fault of this change.
> see details at 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, 
> yes in case of long network partition, the behavior is not ideal, but it's 
> consistent with the definition of logical clock), so that we could have 
> expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to