Re: Getting the counters with the highest values

Robert Wille Mon, 24 Nov 2014 09:34:46 -0800

We do get a large number of documents getting counts each day, which is why I’m 
thinking the running totals table be ((doc_id), day) rather than ((day), 
doc_id). We have too many documents per day to materialize in memory, so 
querying per day and aggregating the results isn’t really possible.


I’m planning on bucketing the materialized ordering because we get enough 
unique document views per day that the rows will be quite large. Not so large 
as to be unmanageable, but pushing the limits. If we were so lucky as to get a 
significant increase in traffic, I might start having issues. I didn’t include 
bucketing in my post because I didn’t want to complicate my question. I hadn’t 
considered that I could bucket by hour and then use a local midnight instead of 
a global midnight. Interesting idea.

Thanks for your response.

Robert

On Nov 24, 2014, at 9:40 AM, Eric Stevens 
<migh...@gmail.com<mailto:migh...@gmail.com>> wrote:

You're right that there's no way to use the counter data type to materialize a 
view ordered by the counter.  Computing this post hoc is the way to go if your 
needs allow for it (if not, something like Summingbird or vanilla Storm may be 
necessary).

I might suggest that you make your primary key for your running totals by day 
table be ((day), doc_id) because it will make it easy to compute the 
materialized ordered view (SELECT doc_id, count FROM running_totals WHERE 
day=?) unless you expect to have a really large number of documents getting 
counts each day.

For your materialized ordering, I'd suggest a primary key of ((day), count) as 
then for a given day you'll be able to select top by count (SELECT count, 
doc_id FROM doc_counts WHERE day=? ORDER BY count DESC).

One more thing to consider if your users are not all in a single timezone is 
having your time bucket be hour instead of day so that you can answer by day 
goal posted by local midnight (except the handful of locations that use half 
hour timezone offsets) instead of a single global midnight.  You can then 
either include either just each hour in each row (and aggregate at read time), 
or you can make each row a rolling 24 hours (aggregating at write time), 
depending on which use case fits your needs better.

On Sun Nov 23 2014 at 8:42:11 AM Robert Wille 
<rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote:
I’m working on moving a bunch of counters out of our relational database to 
Cassandra. For the most part, Cassandra is a very nice fit, except for one 
feature on our website. We manage a time series of view counts for each 
document, and display a list of the most popular documents in the last seven 
days. This seems like a pretty strong anti-pattern for Cassandra, but also 
seems like something a lot of people would want to do. If you’re keeping 
counters, its pretty likely that you’d want to know which ones have the highest 
counts.

Here’s what I came up with to implement this feature. Create a counter table 
with primary key (doc_id, day) and a single counter. Whenever a document is 
viewed, increment the counter for the document for today and the previous six 
days. Sometime after midnight each day, compile the counters into a table with 
primary key (day, count, doc_id) and no additional columns. For each partition 
in the counter table, I would sum up the counters, delete any counters that are 
over a week old, and put the sum into the second table with day = today. When I 
query the table, i would ask for data where day = yesterday. During the 
compilation process, I would delete old partitions. In theory I’d only need two 
partitions. One that is being built, and one for querying.

I’d be interested to hear critiques on this strategy, as well as hearing how 
other people have implemented a "most-popular" feature using Cassandra counters.

Robert

Re: Getting the counters with the highest values

Reply via email to