[ 
https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782998#comment-13782998
 ] 

Nicolas Favre-Felix commented on CASSANDRA-4775:
------------------------------------------------

I realize that there hasn't been much progress on this ticket during the 
summer. Following [~jbellis]'s call for contributors on cassandra-dev, we 
(Acunu) are willing to dedicate more time and resources to this particular 
ticket to see it implemented in the near future.

[~iamaleksey], thanks for your comments and critique; I would like to summarize 
the above and hopefully move this discussion towards a design that we can agree 
on.

As you pointed out, the set-based approach has some significant drawbacks:
* The design is complex and the distributed problem of merging deltas pushes 
some of this complexity to the client side.
* Reads are always more expensive as all replicas need to return the set of 
deltas they know about.
* The unpredictability in the distribution of read latencies is a serious issue.

The set-based design also has two advantages, speed (skipping a random read) 
and idempotence; I'd like to address these two below.

Although the read on the write path is an anti-pattern for Cassandra, it serves 
a useful purpose and ensures that the data sent from the coordinator to the 
replicas is always already summed up which means that reads stay predictably 
fast. The read during replicate_on_write is also more expensive than the one in 
a RMW design since we have to look up all the counter shards in (potentially) 
many sstables, and we have to do this every single time we increment a counter. 
RMW counters in Riak and HBase only need to read the latest value for the 
counter, and can immediately write it back to an in-memory data structure. I 
would also like to point out that reads will only become cheaper in the future 
as solid state drives become more commonplace, and not by a small amount. 
Designing a complex and unreliable solution to address a disappearing problem 
would be a mistake.

Idempotence is also a very useful property and the lack of safety for Cassandra 
counters is probably the first drawback that people mention when they describe 
Cassandra counters. Their reliability is questioned pretty often, and this 
criticism is not without merit. [~slebresne]'s suggestion of a retryable 
increment with a lock around the commit log entry is a great improvement over 
the current design, with the only limitation that the operation has to be 
commutative. I have written above that I had my doubts about the throughput of 
counters with such a lock, but I also recognize that there could be more 
optimisations as some suggested.

Internally, I would suggest the sharded counter design remain similar with one 
shard per replica for all commutative replicated data types; implementing 
counters as PN-counters is also a great way to introduce more general data 
types, something I believe would be much welcomed by the Cassandra community.

After this long discussion I believe the following design would be a viable 
alternative:
* Remove the replicate_on_write option and switch to a locked/retryable RMW.
* Provide a pluggable way to implement commutative types, with a PN-counter 
implementation.

This solution could also be migrated to from an existing deployment; the 
proposed set-based design would be too complex for that.

We are ready to work on the implementation as well in time for the target 
release of 2.1.

> Counters 2.0
> ------------
>
>                 Key: CASSANDRA-4775
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>            Assignee: Aleksey Yeschenko
>              Labels: counters
>             Fix For: 2.1
>
>
> The existing partitioned counters remain a source of frustration for most 
> users almost two years after being introduced.  The remaining problems are 
> inherent in the design, not something that can be fixed given enough 
> time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to