Re: best practices for time-series data with massive amounts of records

Clint Kelly Fri, 06 Mar 2015 14:44:42 -0800

Hi all,

Thanks for the responses, this was very helpful.

I don't know yet what the distribution of clicks and users will be, but I
expect to see a few users with an enormous amount of interactions and most
users having very few.  The idea of doing some additional manual
partitioning, and then maintaining another table that contains the "head"
partition for each user makes sense, although it would add additional
latency when we want to get say the most recent 1000 interactions for a
given user (which is something that we have to do sometimes for
applications with tight SLAs).

FWIW I doubt that any users will have so many interactions that they exceed
what we could reasonably put in a row, but I wanted to have a strategy to
deal with this.

Having a nice design pattern in Cassandra for maintaining a row with the
N-most-recent interactions would also solve this reasonably well, but I
don't know of any way to implement that without running batch jobs that
periodically clean out data (which might be okay).

Best regards,
Clint

On Tue, Mar 3, 2015 at 8:10 AM, mck <m...@apache.org> wrote:

>
> > Here "partition" is a random digit from 0 to (N*M)
> > where N=nodes in cluster, and M=arbitrary number.
>
>
> Hopefully it was obvious, but here (unless you've got hot partitions),
> you don't need N.
> ~mck
>

Re: best practices for time-series data with massive amounts of records

Reply via email to