I’m working on moving a bunch of counters out of our relational database to 
Cassandra. For the most part, Cassandra is a very nice fit, except for one 
feature on our website. We manage a time series of view counts for each 
document, and display a list of the most popular documents in the last seven 
days. This seems like a pretty strong anti-pattern for Cassandra, but also 
seems like something a lot of people would want to do. If you’re keeping 
counters, its pretty likely that you’d want to know which ones have the highest 
counts. 

Here’s what I came up with to implement this feature. Create a counter table 
with primary key (doc_id, day) and a single counter. Whenever a document is 
viewed, increment the counter for the document for today and the previous six 
days. Sometime after midnight each day, compile the counters into a table with 
primary key (day, count, doc_id) and no additional columns. For each partition 
in the counter table, I would sum up the counters, delete any counters that are 
over a week old, and put the sum into the second table with day = today. When I 
query the table, i would ask for data where day = yesterday. During the 
compilation process, I would delete old partitions. In theory I’d only need two 
partitions. One that is being built, and one for querying.

I’d be interested to hear critiques on this strategy, as well as hearing how 
other people have implemented a "most-popular" feature using Cassandra counters.

Robert

Reply via email to