[ https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613693#comment-13613693 ]
Sylvain Lebresne commented on CASSANDRA-4775: --------------------------------------------- For the record, I'd like to "quickly" sum up what are the problems of the counters 1.0, and more precisely what I think are problems inherent to the design, and what I believe might be fixable with some effort. The current counter implementation is based on the idea of internally keeping one separated sub-counter (or "shard") for each replica of the counter, and making sure that for each increment, one shard and only one is ever incremented. The latter is ensure by the special write path of counters that: * pick a live replica and forward it the increment * have that replica increment it's own "shard" locally * then have the replica send the *result* of this local shard increment to the other replicas This mechanism have (at least) the following problems: # counters cannot be retried safely on timeout. # removing counters works only halfway. If you re-increment a deleted counter too soon, the result is somewhat random. Those problems are largely due to the general mechanism used, not to implementation details. That being said, on the retry problem, I'll note that while I don't think we can fix it in the current mechanism, tickets like CASSANDRA-3199 could mitigate it somewhat by making TimeoutException less likely. Other problems are more due to how the implementation works. More precisely, they are due to how a replica proceed to incrementing it's own shard. To do that, the implementation uses separated merging rules for "local" shards and "remote" ones. Namely, local shards are summed during merge (so the sub-count they contain is used as a delta) while for remote ones, the "biggest" value is kept (where "biggest" means "the one with the biggest clock"). So for remote shards, conflicts are handled as "latests wins" as usual. The reason for that difference between local and remote shards is a performance one: when a replica needs to increment his shard, it needs to do that "atomically". So if local shard were handled like remote ones, then to increment the local shard we would need to 1) grab a lock, 2) read the current value, 3) increment it, 4) write it and then 5) release the lock. Instead, in the current implementation, the replica just insert an increment to his own shard. And to find the total value of its local shard, it just read and increments get merged on reads. In practice, what we win is that we don't have to grab a lock. However, I believe that "implementation detail" is responsible for a fair amount of the pain counters are. In particular it complicates the implementation substantially because: * a local shard on one replica is a remote shard on another replica. We handle this by transforming shards during deserialization, which is complex and fragile. It's also the source of CASSANDRA-4071 (and at least one contributor to CASSANDRA-4417). * we have to be extremely careful not to duplicate a local shard internally or we'll over-count. The storage engine having been initially designed with the idea that using the same column twice was harmless, this has led to a number of bugs. We could change that "implementation detail". Instead we could stop distinguishing the merge rules for local shard, and when a replica need to increment his hard, he would read/increment/write while holding a lock to ensure atomicity. This would likely simplify the implementation and fix CASSANDRA-4071 and CASSANDRA-4417. Of course, this would still not fix the other top-level problems (not being able to replay, broken remove, ....). > Counters 2.0 > ------------ > > Key: CASSANDRA-4775 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4775 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Arya Goudarzi > Labels: counters > Fix For: 2.0 > > > The existing partitioned counters remain a source of frustration for most > users almost two years after being introduced. The remaining problems are > inherent in the design, not something that can be fixed given enough > time/eyeballs. > Ideally a solution would give us > - similar performance > - less special cases in the code > - potential for a retry mechanism -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira