Thanks for the advise, I'll definitely take a look at how Spark works and
how it can help with counting.

One last question: My current implementation of counting is 1) increment
counter 2) read counter immediately after the write 3) write counts to
multiple tables for different query paths and solr. If I switch to Spark,
do I still needs to use counter or counting will be done by spark on
regular table?

On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla <rsvi...@datastax.com> wrote:

> increment wouldn't be idempotent from the client unless you knew the count
> at the time of the update (which you could do with LWT but that has pretty
> harsh performance), that particular jira is about how they're laid out and
> avoiding race conditions between nodes, which was resolved in 2.1 beta 1
> (which is now in officially out in the 2.1.x branch)
>
> General improvements on counters in 2.1 are laid out here
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters
>
> As for best practice the answer is multiple tables for multiple query
> paths, or you can use something like solr or spark, take a look at the
> spark cassandra connector for a good way to count on lots of data from lots
> of different query paths
> https://github.com/datastax/spark-cassandra-connector.
>
>
>
> On Mon, Dec 22, 2014 at 9:22 PM, ziju feng <pkdog...@gmail.com> wrote:
>
>> I just skimmed through JIRA
>> <https://issues.apache.org/jira/browse/CASSANDRA-4775> and it seems
>> there has been some effort to make update idempotent. Perhaps the problem
>> can be fixed in the near future?
>>
>> Anyway, what is the current best practice for such use case? (Counting
>> and displaying counts in different queries) I don't need a 100% accurate
>> count and strong consistency. Performance and application complexity is my
>> main concern.
>>
>> Thanks
>>
>> On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla <rsvi...@datastax.com>
>> wrote:
>>
>>> You can cheat it by using the non counter column as part of your primary
>>> key (clustering column specifically) but the cases where this could work
>>> are limited and the places this is a good idea are even more rare.
>>>
>>> As for using counters in batches are already a not well regarded concept
>>> and counter batches have a number of troubling behaviors, as already stated
>>> increments aren't idempotent and batch implies retry.
>>>
>>> As for DSE search its doing something drastically different internally
>>> and the type of counting its doing is many orders of magnitude faster (
>>> think bitmask style matching + proper async 2i to minimize fanout cost)
>>>
>>> Generally speaking counting accurately while being highly available
>>> creates an interesting set of logical tradeoffs. Example what do you do if
>>> you're not able to communicate between two data centers, but both are up
>>> and serving "likes" quite happily? Is your counting down? Do you keep
>>> counting but serve up different answers? More accurately since problems are
>>> rarely data center to data center but more frequently between replicas, how
>>> much availability are you willing to give up in exchange for a globally
>>> accurate count?
>>> On Dec 22, 2014 6:00 AM, "DuyHai Doan" <doanduy...@gmail.com> wrote:
>>>
>>>> It's not possible to mix counter and non counter columns because
>>>> currently the semantic of counter is only increment/decrement (thus NOT
>>>> idempotent) and requires some special handling compared to other C* 
>>>> columns.
>>>>
>>>> On Mon, Dec 22, 2014 at 11:33 AM, ziju feng <pkdog...@gmail.com> wrote:
>>>>
>>>>> ​I was wondering if there is plan to allow ​creating counter column
>>>>> and standard column in the same table.
>>>>>
>>>>> Here is my use case:
>>>>> I want to use counter to count how many users like a given item in my
>>>>> application. The like count needs to be returned along with details of 
>>>>> item
>>>>> in query. To support querying items in different ways, I use both
>>>>> application-maintained denormalized index tables and DSE search for
>>>>> indexing. (DSE search is also used for text searching)
>>>>>
>>>>> Since current counter implementation doesn't allow having counter
>>>>> columns and non-counter columns in the same table, I have to propagate the
>>>>> current count from counter table to the main item table and index tables,
>>>>> so that like counts can be returned by those index tables without sending
>>>>> extra requests to counter table and DSE search is able to build index on
>>>>> like count column in the main item table to support like count related
>>>>> queries (such as sorting by like count).
>>>>>
>>>>> IMHO, the only way to sync data between counter table and normal table
>>>>> within a reasonable time (sub-seconds) currently is to read the current
>>>>> value from counter table right after the update. However it suffers from
>>>>> several issues:
>>>>> 1. Read-after-write may not return the correct count when replication
>>>>> factor > 1 unless consistency level ALL/LOCAL_ALL is used
>>>>> 2. There are two extra non-parallelizable round-trips between the
>>>>> application server and cassandra, which can have great impact on
>>>>> performance.
>>>>>
>>>>> If it is possible to store counter in standard column family, only one
>>>>> write will be needed to update like count in the main table. Counter value
>>>>> will also be eventually synced between replicas so that there is no need
>>>>> for application to use extra mechanism like scheduled task to get the
>>>>> correct counts.
>>>>>
>>>>> A related issue is lifting the limitation of not allowing updating
>>>>> counter columns and normal columns in one batch, since it is quite common
>>>>> to not only have a counter for statistics but also store the details, such
>>>>> as storing the relation of which user likes which items in my user case.
>>>>>
>>>>> Any idea?
>>>>>
>>>>>
>>>>
>>
>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>

Reply via email to