[ https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700953#comment-13700953 ]
Nicolas Favre-Felix commented on CASSANDRA-4775: ------------------------------------------------ A few comments on the design posted above in a GitHub gist: * The "time" part of the client-provided TimeUUID is now compared to the server's timestamp in the test "if(time(update-timeuuid) < now() - counter_write_window)". This is not ideal in my opinion, but I guess Cassandra is now using "real" timestamps a lot more than it used to. In any case, an "old" delta could also fall behind a "merge" cell and be ignored on read. * Having "merge cells" means that we could support both TTLs and "put" operations on counters, as long as the semantics are well defined. * Could counter merges happen in the background at ALL since most reads will receive responses from all replicas anyway, or would we always require QUORUM writes? This could be too restrictive in multi-DC deployments where most people probably prefer to read and write at LOCAL_QUORUM. As described above, finding a "merge point" at which we could roll-up deltas involves either QUORUM reads + QUORUM writes or a read at ALL. This is necessary since we need a majority of replicas to persist the merge cell. We could consider this "set of deltas" that make up a counter to be merged at different levels, though. When this set is common to all replicas (as is proposed above), we can only merge in QUORUM reads if we can _guarantee_ quorum writes or must merge in reads at ALL otherwise. If, instead, we shard this "set of deltas" among replicas with a distribution scheme resembling the existing implementation, each replica becomes the "coordinator" for a fraction of deltas and can (on its own) merge the increments for which it is responsible and issue "merge cells" to replace them. It becomes possible to read and write at LOCAL_QUORUM using this scheme. As any particular replica is the only source of truth for the subset of deltas that it was assigned, it does _by definition_ read ALL of its deltas and can sum them up with no risk of inconsistency. When these cells are node-local with a single source of truth, they can be merged by their owner and a merge cell replicated easily. The main issue with this implementation is the choice of the coordinator node for an increment operation: if we assign a replica at random, retrying would lead to duplicates; if we assign a replica deterministically (based on the operation's UUID for example) we risk not being able to write to the counter if that particular replica goes down. I'd like to propose a solution that lies between merging counters across the whole cluster and merging counters in each individual replica: We can shard counters based on the datacenter, and roll-up these UUIDs per DC. In that case, the scope of the set of replicas involved in merging deltas together is therfore limited to the replicas of the local DC, which (once again) can merge deltas by either getting involved at W.QUORUM+R.QUORUM or W.anything+R.ALL. A configuration flag per counter CF would configure whether we require W.QUORUM+R.QUORUM (default) or let clients write with any CL with the downside that deltas can only be merged at CL.ALL. The same issue with retries applies here, albeit at a different level: a particular operation can only be retried safely if it sent to the same datacenter, which seems reasonable. I believe that the space and time overheads are about the same as in Aleksey's design. Suggestions and ideas much welcome. > Counters 2.0 > ------------ > > Key: CASSANDRA-4775 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4775 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Arya Goudarzi > Assignee: Aleksey Yeschenko > Labels: counters > Fix For: 2.1 > > > The existing partitioned counters remain a source of frustration for most > users almost two years after being introduced. The remaining problems are > inherent in the design, not something that can be fixed given enough > time/eyeballs. > Ideally a solution would give us > - similar performance > - less special cases in the code > - potential for a retry mechanism -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira