[jira] [Commented] (CASSANDRA-4775) Counters 2.0

Sylvain Lebresne (JIRA) Tue, 26 Mar 2013 03:55:21 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613693#comment-13613693
 ]


Sylvain Lebresne commented on CASSANDRA-4775:
---------------------------------------------

For the record, I'd like to "quickly" sum up what are the problems of the 
counters 1.0, and more precisely what I think are problems inherent to the 
design, and what I believe might be fixable with some effort.

The current counter implementation is based on the idea of internally keeping 
one separated sub-counter (or "shard") for each replica of the counter, and 
making sure that for each increment, one shard and only one is ever 
incremented. The latter is ensure by the special write path of counters that:
* pick a live replica and forward it the increment
* have that replica increment it's own "shard" locally
* then have the replica send the *result* of this local shard increment to the 
other replicas

This mechanism have (at least) the following problems:
# counters cannot be retried safely on timeout.
# removing counters works only halfway. If you re-increment a deleted counter 
too soon, the result is somewhat random. 

Those problems are largely due to the general mechanism used, not to 
implementation details. That being said, on the retry problem, I'll note that 
while I don't think we can fix it in the current mechanism, tickets like 
CASSANDRA-3199 could mitigate it somewhat by making TimeoutException less 
likely.

Other problems are more due to how the implementation works. More precisely, 
they are due to how a replica proceed to incrementing it's own shard. To do 
that, the implementation uses separated merging rules for "local" shards and 
"remote" ones. Namely, local shards are summed during merge (so the sub-count 
they contain is used as a delta) while for remote ones, the "biggest" value is 
kept (where "biggest" means "the one with the biggest clock"). So for remote 
shards, conflicts are handled as "latests wins" as usual. The reason for that 
difference between local and remote shards is a performance one: when a replica 
needs to increment his shard, it needs to do that "atomically". So if local 
shard were handled like remote ones, then to increment the local shard we would 
need to 1) grab a lock, 2) read the current value, 3) increment it, 4) write it 
and then 5) release the lock. Instead, in the current implementation, the 
replica just insert an increment to his own shard. And to find the total value 
of its local shard, it just read and increments get merged on reads. In 
practice, what we win is that we don't have to grab a lock.

However, I believe that "implementation detail" is responsible for a fair 
amount of the pain counters are. In particular it complicates the 
implementation substantially because:
* a local shard on one replica is a remote shard on another replica. We handle 
this by transforming shards during deserialization, which is complex and 
fragile. It's also the source of CASSANDRA-4071 (and at least one contributor 
to CASSANDRA-4417).
* we have to be extremely careful not to duplicate a local shard internally or 
we'll over-count. The storage engine having been initially designed with the 
idea that using the same column twice was harmless, this has led to a number of 
bugs.

We could change that "implementation detail". Instead we could stop 
distinguishing the merge rules for local shard, and when a replica need to 
increment his hard, he would read/increment/write while holding a lock to 
ensure atomicity. This would likely simplify the implementation and fix 
CASSANDRA-4071 and CASSANDRA-4417. Of course, this would still not fix the 
other top-level problems (not being able to replay, broken remove, ....).

                
> Counters 2.0
> ------------
>
>                 Key: CASSANDRA-4775
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>              Labels: counters
>             Fix For: 2.0
>
>
> The existing partitioned counters remain a source of frustration for most 
> users almost two years after being introduced.  The remaining problems are 
> inherent in the design, not something that can be fixed given enough 
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4775) Counters 2.0

Reply via email to