[ 
https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051829#comment-13051829
 ] 

Sylvain Lebresne commented on CASSANDRA-2774:
---------------------------------------------

{quote}
let's say we have 4 nodes, A B C D. all the traffic we observe is, with 
increasing timestamp():

A leader add 1 ts=100
B leader delete ts=200
C leader add 2 ts=300

now the updates so far start to replicate to D: assume that D sees the 
following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is:
[A:1 C:2, last_delete=200, timestamp=300]

now let's all the traffic between A,B,C go through, and they fully resolve 
(receiving pair-wise messages and etc), so A B C all come to state: [A:nil C:2, 
last_delete=200 timestamp=300]

now A's state and D's state are different, let's say we let A repair D, A's 
A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is 
isDelta(), so it trumps D. as a result it seems we never reach agreement 
between A and D, even though traffic is allowed to flow freely.
{quote}

This is *not* how the counter implementation works. In the implementation, only 
A is ever able to increment it's own clock. As a consequence, it is impossible 
for other nodes to have a version for A's shard that is greater than what A 
has. That "scenario" is not a valid scenario.

Now, looking at your patch a bit more closely, I actually fail to see how it 
changes anything. You do impose a read during the write but it only changes the 
CounterColumn.timestampOfLastDelete field. Hence the code is still dependant on 
the merge order of sstables. To be more concrete, even on a single node 
cluster, suppose you do 3 writes (received in that exact order since we're 
considering only one node): +1 then delete then +1. The first +1 will have an 
initial "epoch", let's say 0, the delete will have a bigger epoch, let's say 1 
and the second +1 will inherit that epoch 1. But there is nothing that forces 
all those updates to be in the same sstables and if they are in different 
sstables and the two increments are merged first, it will results in a +2 with 
epoch 1, that, when merge with the tombstone, will just discard it (as of the 
rules of your patch) and we will finally return +2 to the client. But if the 
merge order is different and we first merge the first increment to the 
tombstone and then to the second increment, the final result will be +1. 
Exactly as the code already does.


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of 
> sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds 
> later than the delete. supposedly a delete is to create a completely new 
> incarnation of the counter, or a new "lifetime", or "epoch". the new approach 
> utilizes the concept of "epoch number", so that each delete bumps up the 
> epoch number. since each write is replicated (replicate on write is almost 
> always enabled in practice, if this is a concern, we could further force ROW 
> in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since 
> the semantic is changed a bit. some cql tests do not pass in the original 
> 0.8.0 source, that's not the fault of this change.
> see details at 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3cbanlktikqcglsnwtt-9hvqpseoo7sf58...@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, 
> yes in case of long network partition, the behavior is not ideal, but it's 
> consistent with the definition of logical clock), so that we could have 
> expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to