Re: best practices for time-series data with massive amounts of records

Yulian Oifa Tue, 03 Mar 2015 07:10:50 -0800

Hello
You can use timeuuid as raw key and create sepate CF to be used for indexing
Indexing CF may be either with user_id as key , or a better approach is to
partition row by timestamp.
In case of partition you can create compound key , in which you will store
user_id and timestamp base ( for example if you would like to keep 8 of 13
digits in timestamp , then new row will be created each 100000 seconds -
approximately each day , a bit more and maximum number of rows per user
would be 100K , of course you can play with number of rows/ time for each
row depending on number of records you are receiving. i am creating new row
each 11 days , so its 35 rows per year , per user ) )
In each column you can store timeuuid as name and empty value.


This way you keep you data ordered by time. The only disadvantage of this
approach is that you have to "glue" your data when you finished reading one
index row and started another one ( both asc and desc ).

When reading data you should first get slice depending on your needs from
index , and then get multi_range from original CF based on slice received.
Hope it helps
Best regards
Yulian Oifa



On Mon, Mar 2, 2015 at 9:47 PM, Clint Kelly <clint.ke...@gmail.com> wrote:

> Hi all,
>
> I am designing an application that will capture time series data where we
> expect the number of records per user to potentially be extremely high.  I
> am not sure if we will eclipse the max row size of 2B elements, but I
> assume that we would not want our application to approach that size anyway.
>
> If we wanted to put all of the interactions in a single row, then I would
> make a data model that looks like:
>
> CREATE TABLE events (
>   id text,
>   event_time timestamp,
>   event blob,
>   PRIMARY KEY (id, event_time))
> WITH CLUSTERING ORDER BY (event_time DESC);
>
> The best practice for breaking up large rows of time series data is, as I
> understand it, to put part of the time into the partitioning key (
> http://planetcassandra.org/getting-started-with-time-series-data-modeling/
> ):
>
> CREATE TABLE events (
>   id text,
>   date text, // Could also use year+month here or year+week or something
> else
>   event_time timestamp,
>   event blob,
>   PRIMARY KEY ((id, date), event_time))
> WITH CLUSTERING ORDER BY (event_time DESC);
>
> The downside of this approach is that we can no longer do a simple
> continuous scan to get all of the events for a given user.  Some users may
> log lots and lots of interactions every day, while others may interact with
> our application infrequently, so I'd like a quick way to get the most
> recent interaction for a given user.
>
> Has anyone used different approaches for this problem?
>
> The only thing I can think of is to use the second table schema described
> above, but switch to an order-preserving hashing function, and then
> manually hash the "id" field.  This is essentially what we would do in
> HBase.
>
> Curious if anyone else has any thoughts.
>
> Best regards,
> Clint
>
>
>

Re: best practices for time-series data with massive amounts of records

Reply via email to