[ 
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700953#comment-13700953
 ] 

Nicolas Favre-Felix commented on CASSANDRA-4775:
------------------------------------------------

A few comments on the design posted above in a GitHub gist:

* The "time" part of the client-provided TimeUUID is now compared to the 
server's timestamp in the test "if(time(update-timeuuid) < now() - 
counter_write_window)". This is not ideal in my opinion, but I guess Cassandra 
is now using "real" timestamps a lot more than it used to. In any case, an 
"old" delta could also fall behind a "merge" cell and be ignored on read.
* Having "merge cells" means that we could support both TTLs and "put" 
operations on counters, as long as the semantics are well defined.
* Could counter merges happen in the background at ALL since most reads will 
receive responses from all replicas anyway, or would we always require QUORUM 
writes? This could be too restrictive in multi-DC deployments where most people 
probably prefer to read and write at LOCAL_QUORUM.


As described above, finding a "merge point" at which we could roll-up deltas 
involves either QUORUM reads + QUORUM writes or a read at ALL. This is 
necessary since we need a majority of replicas to persist the merge cell. We 
could consider this "set of deltas" that make up a counter to be merged at 
different levels, though. When this set is common to all replicas (as is 
proposed above), we can only merge in QUORUM reads if we can _guarantee_ quorum 
writes or must merge in reads at ALL otherwise.

If, instead, we shard this "set of deltas" among replicas with a distribution 
scheme resembling the existing implementation, each replica becomes the 
"coordinator" for a fraction of deltas and can (on its own) merge the 
increments for which it is responsible and issue "merge cells" to replace them. 
It becomes possible to read and write at LOCAL_QUORUM using this scheme. As any 
particular replica is the only source of truth for the subset of deltas that it 
was assigned, it does _by definition_ read ALL of its deltas and can sum them 
up with no risk of inconsistency. When these cells are node-local with a single 
source of truth, they can be merged by their owner and a merge cell replicated 
easily.
The main issue with this implementation is the choice of the coordinator node 
for an increment operation: if we assign a replica at random, retrying would 
lead to duplicates; if we assign a replica deterministically (based on the 
operation's UUID for example) we risk not being able to write to the counter if 
that particular replica goes down.

I'd like to propose a solution that lies between merging counters across the 
whole cluster and merging counters in each individual replica:
We can shard counters based on the datacenter, and roll-up these UUIDs per DC. 
In that case, the scope of the set of replicas involved in merging deltas 
together is therfore limited to the replicas of the local DC, which (once 
again) can merge deltas by either getting involved at W.QUORUM+R.QUORUM or 
W.anything+R.ALL.
A configuration flag per counter CF would configure whether we require 
W.QUORUM+R.QUORUM (default) or let clients write with any CL with the downside 
that deltas can only be merged at CL.ALL.
The same issue with retries applies here, albeit at a different level: a 
particular operation can only be retried safely if it sent to the same 
datacenter, which seems reasonable.

I believe that the space and time overheads are about the same as in Aleksey's 
design.

Suggestions and ideas much welcome.

                
> Counters 2.0
> ------------
>
>                 Key: CASSANDRA-4775
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>            Assignee: Aleksey Yeschenko
>              Labels: counters
>             Fix For: 2.1
>
>
> The existing partitioned counters remain a source of frustration for most 
> users almost two years after being introduced.  The remaining problems are 
> inherent in the design, not something that can be fixed given enough 
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to