Spark can count a regular table. Spark sql would be the easiest thing to
get started with most likely.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md

Go down to the spark sql section to get some idea of the ease of use.
On Dec 22, 2014 10:00 PM, "ziju feng" <pkdog...@gmail.com> wrote:

> Thanks for the advise, I'll definitely take a look at how Spark works and
> how it can help with counting.
>
> One last question: My current implementation of counting is 1) increment
> counter 2) read counter immediately after the write 3) write counts to
> multiple tables for different query paths and solr. If I switch to Spark,
> do I still needs to use counter or counting will be done by spark on
> regular table?
>
> On Tue, Dec 23, 2014 at 11:31 AM, Ryan Svihla <rsvi...@datastax.com>
> wrote:
>
>> increment wouldn't be idempotent from the client unless you knew the
>> count at the time of the update (which you could do with LWT but that has
>> pretty harsh performance), that particular jira is about how they're laid
>> out and avoiding race conditions between nodes, which was resolved in 2.1
>> beta 1 (which is now in officially out in the 2.1.x branch)
>>
>> General improvements on counters in 2.1 are laid out here
>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-1-a-better-implementation-of-counters
>>
>> As for best practice the answer is multiple tables for multiple query
>> paths, or you can use something like solr or spark, take a look at the
>> spark cassandra connector for a good way to count on lots of data from lots
>> of different query paths
>> https://github.com/datastax/spark-cassandra-connector.
>>
>>
>>
>> On Mon, Dec 22, 2014 at 9:22 PM, ziju feng <pkdog...@gmail.com> wrote:
>>
>>> I just skimmed through JIRA
>>> <https://issues.apache.org/jira/browse/CASSANDRA-4775> and it seems
>>> there has been some effort to make update idempotent. Perhaps the problem
>>> can be fixed in the near future?
>>>
>>> Anyway, what is the current best practice for such use case? (Counting
>>> and displaying counts in different queries) I don't need a 100% accurate
>>> count and strong consistency. Performance and application complexity is my
>>> main concern.
>>>
>>> Thanks
>>>
>>> On Mon, Dec 22, 2014 at 10:37 PM, Ryan Svihla <rsvi...@datastax.com>
>>> wrote:
>>>
>>>> You can cheat it by using the non counter column as part of your
>>>> primary key (clustering column specifically) but the cases where this could
>>>> work are limited and the places this is a good idea are even more rare.
>>>>
>>>> As for using counters in batches are already a not well regarded
>>>> concept and counter batches have a number of troubling behaviors, as
>>>> already stated increments aren't idempotent and batch implies retry.
>>>>
>>>> As for DSE search its doing something drastically different internally
>>>> and the type of counting its doing is many orders of magnitude faster (
>>>> think bitmask style matching + proper async 2i to minimize fanout cost)
>>>>
>>>> Generally speaking counting accurately while being highly available
>>>> creates an interesting set of logical tradeoffs. Example what do you do if
>>>> you're not able to communicate between two data centers, but both are up
>>>> and serving "likes" quite happily? Is your counting down? Do you keep
>>>> counting but serve up different answers? More accurately since problems are
>>>> rarely data center to data center but more frequently between replicas, how
>>>> much availability are you willing to give up in exchange for a globally
>>>> accurate count?
>>>> On Dec 22, 2014 6:00 AM, "DuyHai Doan" <doanduy...@gmail.com> wrote:
>>>>
>>>>> It's not possible to mix counter and non counter columns because
>>>>> currently the semantic of counter is only increment/decrement (thus NOT
>>>>> idempotent) and requires some special handling compared to other C* 
>>>>> columns.
>>>>>
>>>>> On Mon, Dec 22, 2014 at 11:33 AM, ziju feng <pkdog...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> ​I was wondering if there is plan to allow ​creating counter column
>>>>>> and standard column in the same table.
>>>>>>
>>>>>> Here is my use case:
>>>>>> I want to use counter to count how many users like a given item in my
>>>>>> application. The like count needs to be returned along with details of 
>>>>>> item
>>>>>> in query. To support querying items in different ways, I use both
>>>>>> application-maintained denormalized index tables and DSE search for
>>>>>> indexing. (DSE search is also used for text searching)
>>>>>>
>>>>>> Since current counter implementation doesn't allow having counter
>>>>>> columns and non-counter columns in the same table, I have to propagate 
>>>>>> the
>>>>>> current count from counter table to the main item table and index tables,
>>>>>> so that like counts can be returned by those index tables without sending
>>>>>> extra requests to counter table and DSE search is able to build index on
>>>>>> like count column in the main item table to support like count related
>>>>>> queries (such as sorting by like count).
>>>>>>
>>>>>> IMHO, the only way to sync data between counter table and normal
>>>>>> table within a reasonable time (sub-seconds) currently is to read the
>>>>>> current value from counter table right after the update. However it 
>>>>>> suffers
>>>>>> from several issues:
>>>>>> 1. Read-after-write may not return the correct count when replication
>>>>>> factor > 1 unless consistency level ALL/LOCAL_ALL is used
>>>>>> 2. There are two extra non-parallelizable round-trips between the
>>>>>> application server and cassandra, which can have great impact on
>>>>>> performance.
>>>>>>
>>>>>> If it is possible to store counter in standard column family, only
>>>>>> one write will be needed to update like count in the main table. Counter
>>>>>> value will also be eventually synced between replicas so that there is no
>>>>>> need for application to use extra mechanism like scheduled task to get 
>>>>>> the
>>>>>> correct counts.
>>>>>>
>>>>>> A related issue is lifting the limitation of not allowing updating
>>>>>> counter columns and normal columns in one batch, since it is quite common
>>>>>> to not only have a counter for statistics but also store the details, 
>>>>>> such
>>>>>> as storing the relation of which user likes which items in my user case.
>>>>>>
>>>>>> Any idea?
>>>>>>
>>>>>>
>>>>>
>>>
>>
>>
>> --
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>

Reply via email to