Thanks for the advise, I'll definitely take a look at how Spark works and how it can help with counting.
One last question: My current implementation of counting is 1) increment counter 2) read counter immediately after the write 3) write counts to multiple tables for different query paths and solr. If I switch to Spark, do I still needs to use counter or counting will be done by spark on regular table? On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla <rsvi...@datastax.com> wrote: > increment wouldn't be idempotent from the client unless you knew the count > at the time of the update (which you could do with LWT but that has pretty > harsh performance), that particular jira is about how they're laid out and > avoiding race conditions between nodes, which was resolved in 2.1 beta 1 > (which is now in officially out in the 2.1.x branch) > > General improvements on counters in 2.1 are laid out here > http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters > > As for best practice the answer is multiple tables for multiple query > paths, or you can use something like solr or spark, take a look at the > spark cassandra connector for a good way to count on lots of data from lots > of different query paths > https://github.com/datastax/spark-cassandra-connector. > > > > On Mon, Dec 22, 2014 at 9:22 PM, ziju feng <pkdog...@gmail.com> wrote: > >> I just skimmed through JIRA >> <https://issues.apache.org/jira/browse/CASSANDRA-4775> and it seems >> there has been some effort to make update idempotent. Perhaps the problem >> can be fixed in the near future? >> >> Anyway, what is the current best practice for such use case? (Counting >> and displaying counts in different queries) I don't need a 100% accurate >> count and strong consistency. Performance and application complexity is my >> main concern. >> >> Thanks >> >> On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla <rsvi...@datastax.com> >> wrote: >> >>> You can cheat it by using the non counter column as part of your primary >>> key (clustering column specifically) but the cases where this could work >>> are limited and the places this is a good idea are even more rare. >>> >>> As for using counters in batches are already a not well regarded concept >>> and counter batches have a number of troubling behaviors, as already stated >>> increments aren't idempotent and batch implies retry. >>> >>> As for DSE search its doing something drastically different internally >>> and the type of counting its doing is many orders of magnitude faster ( >>> think bitmask style matching + proper async 2i to minimize fanout cost) >>> >>> Generally speaking counting accurately while being highly available >>> creates an interesting set of logical tradeoffs. Example what do you do if >>> you're not able to communicate between two data centers, but both are up >>> and serving "likes" quite happily? Is your counting down? Do you keep >>> counting but serve up different answers? More accurately since problems are >>> rarely data center to data center but more frequently between replicas, how >>> much availability are you willing to give up in exchange for a globally >>> accurate count? >>> On Dec 22, 2014 6:00 AM, "DuyHai Doan" <doanduy...@gmail.com> wrote: >>> >>>> It's not possible to mix counter and non counter columns because >>>> currently the semantic of counter is only increment/decrement (thus NOT >>>> idempotent) and requires some special handling compared to other C* >>>> columns. >>>> >>>> On Mon, Dec 22, 2014 at 11:33 AM, ziju feng <pkdog...@gmail.com> wrote: >>>> >>>>> I was wondering if there is plan to allow creating counter column >>>>> and standard column in the same table. >>>>> >>>>> Here is my use case: >>>>> I want to use counter to count how many users like a given item in my >>>>> application. The like count needs to be returned along with details of >>>>> item >>>>> in query. To support querying items in different ways, I use both >>>>> application-maintained denormalized index tables and DSE search for >>>>> indexing. (DSE search is also used for text searching) >>>>> >>>>> Since current counter implementation doesn't allow having counter >>>>> columns and non-counter columns in the same table, I have to propagate the >>>>> current count from counter table to the main item table and index tables, >>>>> so that like counts can be returned by those index tables without sending >>>>> extra requests to counter table and DSE search is able to build index on >>>>> like count column in the main item table to support like count related >>>>> queries (such as sorting by like count). >>>>> >>>>> IMHO, the only way to sync data between counter table and normal table >>>>> within a reasonable time (sub-seconds) currently is to read the current >>>>> value from counter table right after the update. However it suffers from >>>>> several issues: >>>>> 1. Read-after-write may not return the correct count when replication >>>>> factor > 1 unless consistency level ALL/LOCAL_ALL is used >>>>> 2. There are two extra non-parallelizable round-trips between the >>>>> application server and cassandra, which can have great impact on >>>>> performance. >>>>> >>>>> If it is possible to store counter in standard column family, only one >>>>> write will be needed to update like count in the main table. Counter value >>>>> will also be eventually synced between replicas so that there is no need >>>>> for application to use extra mechanism like scheduled task to get the >>>>> correct counts. >>>>> >>>>> A related issue is lifting the limitation of not allowing updating >>>>> counter columns and normal columns in one batch, since it is quite common >>>>> to not only have a counter for statistics but also store the details, such >>>>> as storing the relation of which user likes which items in my user case. >>>>> >>>>> Any idea? >>>>> >>>>> >>>> >> > > > -- > > [image: datastax_logo.png] <http://www.datastax.com/> > > Ryan Svihla > > Solution Architect > > [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png] > <http://www.linkedin.com/pub/ryan-svihla/12/621/727/> > > DataStax is the fastest, most scalable distributed database technology, > delivering Apache Cassandra to the world’s most innovative enterprises. > Datastax is built to be agile, always-on, and predictably scalable to any > size. With more than 500 customers in 45 countries, DataStax is the > database technology and transactional backbone of choice for the worlds > most innovative companies such as Netflix, Adobe, Intuit, and eBay. > >