[ 
https://issues.apache.org/jira/browse/CASSANDRA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915683#action_12915683
 ] 

Sylvain Lebresne commented on CASSANDRA-1546:
---------------------------------------------

(I updated the patch because I've found a way to simplify a bit of the code
(I've removed the special deserialization function in ColumnSerializer if some
had read the code already))

Allow me to explain a little further how this work before answering the
preceding questions (sorry if that a tad long).

Let's consider a counter c whose replicas is node A, B and C. Let's say that
we have updated 3 times the counter, with values 1, 2 and 3 respectively and
with node A, B and C for respective 'update leader'.
The row for c (I don't consider marker columns here) will be the following one
*on node A*:
{noformat}
  c : {
    <A ip address> : 1, (LocalCounterColumn)
    <B ip address> : 2, (CounterColumn)
    <C ip address> : 3, (CounterColumn)
  }
{noformat}
and on *node B*, the row for c will be:
{noformat}
  c : {
    <A ip address> : 1, (CounterColumn)
    <B ip address> : 2, (LocalCounterColumn)
    <C ip address> : 3, (CounterColumn)
  }
{noformat}
In parenthesis are the actual class implementing the column. Note that on each
node, the column with its id is special. And the difference is that when a
LocalCounterColumn c1 conflicts with another LocalCounterColumn c2, then we
resolve this by returning a new LocalCounterColumn c3, whose value is
c1.value() + c2.value() (and the timestamp is the max of c1 and c2 timestmap). 
CounterColumn in contrast have the exact same resolution than standard column
(that is, if two CounterColumn conflicts, the result is the one with higher
timestamp).

So, to answer the question about serializing the writes, there is no need (and
I believe it's a good thing performance-wise). When a leader receives an
update, it doesn't read-then-write. It writes-then-read. And as parts of the
read, the newly inserted LocalCounterColumn will be 'merged' with the other,
already present LocalCounterColumn and yield the actual value of the column,
without the risk of loosing an increment.

But now, we see that the data is not exactly mirrored in the nodes. In
particular, there is one thing that we must absolutely avoid: we should never
have a repair operation (read repair or AE repair) that inserts to node A a
LocalCounterColumn whose name is <A ip address> (otherwise, this would get
added to the actual value and screw up the total counter value). Another way
to say this is that the value of the column <A ip address> is always equals to
the sum of all the update A have leads, and we are sure of that. So we need
not repair the value of this column on node A and we *must never do it*.
Moreover, when A sends it's value parts to B, it sends a LocalCounterColumn,
but when received by B (or any other host for this matter), it should become a
CounterColumn.

The implementation enforces this in the ColumnSerializer, during
deserialization. When a node deserialize a (serialized) LocalCounterColumn, it
will always deserialize it as a CounterColumn unless, it is its
locaCounterColumn. So when A sends it LocalCounterColumn to B (for a read
repair say), B will deserialize it as a CounterColumn. If now B sends this
back to A, A will receive a CounterColumn for its local counter column and
it will discard it. So, because we ensure that an host different from A will
never 'see' a LocalCounterColumn whose name is <A ip address> (but it will see
such CounterColumn), we know that we will never wrongfully repair the local
counter of A.

During AE repair, because we use streaming, we could end up with a SStable on
B having a LocalCounterColumn of name <A ip address>. However, as soon a this
column is deserialized, it is deserialized as a CounterColumn. So here again,
we will not wrongfully repair A.
Unless ... we stream back the exact same sstable to A. But I think this can
never happen (anybody more familiar with AE repair and streaming could 
confirm?).

> (Yet another) approach to counting
> ----------------------------------
>
>                 Key: CASSANDRA-1546
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1546
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 0.7.0
>
>         Attachments: 0001-Remove-IClock-from-internals.patch, 
> 0002-Counters.patch, 0003-Generated-thrift-files-changes.patch
>
>
> This could be described as a mix between CASSANDRA-1072 without clocks and 
> CASSANDRA-1421.
> More details in the comment below.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to