Re: best practices for time-series data with massive amounts of records
It's probably quite rare for extremely large time series data to be querying the whole set of data. Instead there's almost always a Between X and Y dates aspect to nearly every real time query you might have against a table like this (with the exception of most recent N events). Because of this, time bucketing can be an effective strategy, though until you understand your data better, it's hard to know how large (or small) to make your buckets. Because of *that*, I recommend using timestamp data type for your bucketing strategy - this gives you the advantage of being able to reduce your bucket sizes while keeping your at-rest data mostly still quite accessible. What I mean is that if you change your bucketing strategy from day to hour, when you are querying across that changed time period, you can iterate at the finer granularity buckets (hour), and you'll pick up the coarser granularity (day) automatically for all but the earliest bucket (which is easy to correct for when you're flooring your start bucket). In the coarser time period, most reads are partition key misses, which are extremely inexpensive in Cassandra. If you do need most-recent-N queries for broad ranges and you expect to have some users whose clickrate is dramatically less frequent than your bucket interval (making iterating over buckets inefficient), you can keep a separate counter table with PK of ((user_id), bucket) in which you count new events. Now you can identify the exact set of buckets you need to read to satisfy the query no matter what the user's click volume is (so very low volume users have at most N partition keys queried, higher volume users query fewer partition keys). On Fri, Mar 6, 2015 at 4:06 PM, graham sanderson gra...@vast.com wrote: Note that using static column(s) for the “head” value, and trailing TTLed values behind is something we’re considering. Note this is especially nice if your head state includes say a map which is updated by small deltas (individual keys) We have not yet studied the effect of static columns on say DTCS On Mar 6, 2015, at 4:42 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, Thanks for the responses, this was very helpful. I don't know yet what the distribution of clicks and users will be, but I expect to see a few users with an enormous amount of interactions and most users having very few. The idea of doing some additional manual partitioning, and then maintaining another table that contains the head partition for each user makes sense, although it would add additional latency when we want to get say the most recent 1000 interactions for a given user (which is something that we have to do sometimes for applications with tight SLAs). FWIW I doubt that any users will have so many interactions that they exceed what we could reasonably put in a row, but I wanted to have a strategy to deal with this. Having a nice design pattern in Cassandra for maintaining a row with the N-most-recent interactions would also solve this reasonably well, but I don't know of any way to implement that without running batch jobs that periodically clean out data (which might be okay). Best regards, Clint On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org wrote: Here partition is a random digit from 0 to (N*M) where N=nodes in cluster, and M=arbitrary number. Hopefully it was obvious, but here (unless you've got hot partitions), you don't need N. ~mck
Re: best practices for time-series data with massive amounts of records
Hi all, Thanks for the responses, this was very helpful. I don't know yet what the distribution of clicks and users will be, but I expect to see a few users with an enormous amount of interactions and most users having very few. The idea of doing some additional manual partitioning, and then maintaining another table that contains the head partition for each user makes sense, although it would add additional latency when we want to get say the most recent 1000 interactions for a given user (which is something that we have to do sometimes for applications with tight SLAs). FWIW I doubt that any users will have so many interactions that they exceed what we could reasonably put in a row, but I wanted to have a strategy to deal with this. Having a nice design pattern in Cassandra for maintaining a row with the N-most-recent interactions would also solve this reasonably well, but I don't know of any way to implement that without running batch jobs that periodically clean out data (which might be okay). Best regards, Clint On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org wrote: Here partition is a random digit from 0 to (N*M) where N=nodes in cluster, and M=arbitrary number. Hopefully it was obvious, but here (unless you've got hot partitions), you don't need N. ~mck
Re: best practices for time-series data with massive amounts of records
Note that using static column(s) for the “head” value, and trailing TTLed values behind is something we’re considering. Note this is especially nice if your head state includes say a map which is updated by small deltas (individual keys) We have not yet studied the effect of static columns on say DTCS On Mar 6, 2015, at 4:42 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, Thanks for the responses, this was very helpful. I don't know yet what the distribution of clicks and users will be, but I expect to see a few users with an enormous amount of interactions and most users having very few. The idea of doing some additional manual partitioning, and then maintaining another table that contains the head partition for each user makes sense, although it would add additional latency when we want to get say the most recent 1000 interactions for a given user (which is something that we have to do sometimes for applications with tight SLAs). FWIW I doubt that any users will have so many interactions that they exceed what we could reasonably put in a row, but I wanted to have a strategy to deal with this. Having a nice design pattern in Cassandra for maintaining a row with the N-most-recent interactions would also solve this reasonably well, but I don't know of any way to implement that without running batch jobs that periodically clean out data (which might be okay). Best regards, Clint On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org mailto:m...@apache.org wrote: Here partition is a random digit from 0 to (N*M) where N=nodes in cluster, and M=arbitrary number. Hopefully it was obvious, but here (unless you've got hot partitions), you don't need N. ~mck smime.p7s Description: S/MIME cryptographic signature
Re: best practices for time-series data with massive amounts of records
Hi, I have not done something similar, however I have some comments: On Mon, Mar 2, 2015 at 8:47 PM, Clint Kelly clint.ke...@gmail.com wrote: The downside of this approach is that we can no longer do a simple continuous scan to get all of the events for a given user. Sure, but would you really do that real time anyway? :) If you have billions of events that's not going to scale anyway. Also, if you have 10 events per bucket. The latency introduced by batching should be manageable. Some users may log lots and lots of interactions every day, while others may interact with our application infrequently, This makes another reason to split them up into bucket to make the cluster partitions more manageble and homogenous. so I'd like a quick way to get the most recent interaction for a given user. For this you could actually have a second table that stores the last_time_bucket for a user. Upon event write, you could simply do an update of the last_time_bucket. You could even have an index of all time buckets per user if you want. Has anyone used different approaches for this problem? The only thing I can think of is to use the second table schema described above, but switch to an order-preserving hashing function, and then manually hash the id field. This is essentially what we would do in HBase. Like you might already know, this order preserving hashing is _not_ considered best practise in the Cassandra world. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: best practices for time-series data with massive amounts of records
I'd recommend using 100K and 10M as rough guidelines for the maximum number of rows and bytes in a single partition. Sure, Cassandra can technically handle a lot more than that, but very large partitions can make your life more difficult. Of course you will have to do a POC to validate the sweet spot for your particular app, data model, actual data values, hardware, app access patterns, and app latency requirements. It may be that your actual numbers should be half or twice my guidance, but they are a starting point. Back to your starting point: You really need to characterize the number of records per user. For example, will you have a large number of users with few records? IOW, what are the expected distributions for user count and record per user count. Give some specific numbers. Even if you don't know what the real numbers will be, you have to at least have a model for counts before modeling the partition keys. -- Jack Krupansky On Mon, Mar 2, 2015 at 2:47 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I am designing an application that will capture time series data where we expect the number of records per user to potentially be extremely high. I am not sure if we will eclipse the max row size of 2B elements, but I assume that we would not want our application to approach that size anyway. If we wanted to put all of the interactions in a single row, then I would make a data model that looks like: CREATE TABLE events ( id text, event_time timestamp, event blob, PRIMARY KEY (id, event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The best practice for breaking up large rows of time series data is, as I understand it, to put part of the time into the partitioning key ( http://planetcassandra.org/getting-started-with-time-series-data-modeling/ ): CREATE TABLE events ( id text, date text, // Could also use year+month here or year+week or something else event_time timestamp, event blob, PRIMARY KEY ((id, date), event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The downside of this approach is that we can no longer do a simple continuous scan to get all of the events for a given user. Some users may log lots and lots of interactions every day, while others may interact with our application infrequently, so I'd like a quick way to get the most recent interaction for a given user. Has anyone used different approaches for this problem? The only thing I can think of is to use the second table schema described above, but switch to an order-preserving hashing function, and then manually hash the id field. This is essentially what we would do in HBase. Curious if anyone else has any thoughts. Best regards, Clint
Re: best practices for time-series data with massive amounts of records
Hello You can use timeuuid as raw key and create sepate CF to be used for indexing Indexing CF may be either with user_id as key , or a better approach is to partition row by timestamp. In case of partition you can create compound key , in which you will store user_id and timestamp base ( for example if you would like to keep 8 of 13 digits in timestamp , then new row will be created each 10 seconds - approximately each day , a bit more and maximum number of rows per user would be 100K , of course you can play with number of rows/ time for each row depending on number of records you are receiving. i am creating new row each 11 days , so its 35 rows per year , per user ) ) In each column you can store timeuuid as name and empty value. This way you keep you data ordered by time. The only disadvantage of this approach is that you have to glue your data when you finished reading one index row and started another one ( both asc and desc ). When reading data you should first get slice depending on your needs from index , and then get multi_range from original CF based on slice received. Hope it helps Best regards Yulian Oifa On Mon, Mar 2, 2015 at 9:47 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I am designing an application that will capture time series data where we expect the number of records per user to potentially be extremely high. I am not sure if we will eclipse the max row size of 2B elements, but I assume that we would not want our application to approach that size anyway. If we wanted to put all of the interactions in a single row, then I would make a data model that looks like: CREATE TABLE events ( id text, event_time timestamp, event blob, PRIMARY KEY (id, event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The best practice for breaking up large rows of time series data is, as I understand it, to put part of the time into the partitioning key ( http://planetcassandra.org/getting-started-with-time-series-data-modeling/ ): CREATE TABLE events ( id text, date text, // Could also use year+month here or year+week or something else event_time timestamp, event blob, PRIMARY KEY ((id, date), event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The downside of this approach is that we can no longer do a simple continuous scan to get all of the events for a given user. Some users may log lots and lots of interactions every day, while others may interact with our application infrequently, so I'd like a quick way to get the most recent interaction for a given user. Has anyone used different approaches for this problem? The only thing I can think of is to use the second table schema described above, but switch to an order-preserving hashing function, and then manually hash the id field. This is essentially what we would do in HBase. Curious if anyone else has any thoughts. Best regards, Clint
Re: best practices for time-series data with massive amounts of records
Here partition is a random digit from 0 to (N*M) where N=nodes in cluster, and M=arbitrary number. Hopefully it was obvious, but here (unless you've got hot partitions), you don't need N. ~mck
Re: best practices for time-series data with massive amounts of records
Clint, CREATE TABLE events ( id text, date text, // Could also use year+month here or year+week or something else event_time timestamp, event blob, PRIMARY KEY ((id, date), event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The downside of this approach is that we can no longer do a simple continuous scan to get all of the events for a given user. Some users may log lots and lots of interactions every day, while others may interact with our application infrequently, so I'd like a quick way to get the most recent interaction for a given user. Has anyone used different approaches for this problem? One idea is to provide additional manual partitioning like… CREATE TABLE events ( user_id text, partition int, event_time timeuuid, event_json text, PRIMARY KEY ((user_id, partition), event_time) ) WITH CLUSTERING ORDER BY (event_time DESC) AND compaction={'class': 'DateTieredCompactionStrategy'}; Here partition is a random digit from 0 to (N*M) where N=nodes in cluster, and M=arbitrary number. Read performance is going to suffer a little because you need to query N*M as many partition keys for each read, but should be constant enough that it comes down to increasing the cluster's hardware and scaling out as need be. The multikey reads you can do it with a SELECT…IN query, or better yet with parallel reads (less pressure on the coordinator at expense of extra network calls). Starting with M=1, you have the option to increase it over time if the rows in partitions for any users get too high. (We do¹ something similar for storing all raw events in our enterprise platform, but because the data is not user-centric the initial partition key is minute-by-minute timebuckets, and M has remained at 1 the whole time). This approach is better than using order-preserving partition (really don't do that). I would also consider replacing event blob with event text, choosing json instead of any binary serialisation. We've learnt the hard way the value of data transparency, and i'm guessing the storage cost is small given c* compression. Otherwise the advice here is largely repeating what Jens has already said. ~mck ¹ slide 19+20 from https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/
best practices for time-series data with massive amounts of records
Hi all, I am designing an application that will capture time series data where we expect the number of records per user to potentially be extremely high. I am not sure if we will eclipse the max row size of 2B elements, but I assume that we would not want our application to approach that size anyway. If we wanted to put all of the interactions in a single row, then I would make a data model that looks like: CREATE TABLE events ( id text, event_time timestamp, event blob, PRIMARY KEY (id, event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The best practice for breaking up large rows of time series data is, as I understand it, to put part of the time into the partitioning key ( http://planetcassandra.org/getting-started-with-time-series-data-modeling/): CREATE TABLE events ( id text, date text, // Could also use year+month here or year+week or something else event_time timestamp, event blob, PRIMARY KEY ((id, date), event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The downside of this approach is that we can no longer do a simple continuous scan to get all of the events for a given user. Some users may log lots and lots of interactions every day, while others may interact with our application infrequently, so I'd like a quick way to get the most recent interaction for a given user. Has anyone used different approaches for this problem? The only thing I can think of is to use the second table schema described above, but switch to an order-preserving hashing function, and then manually hash the id field. This is essentially what we would do in HBase. Curious if anyone else has any thoughts. Best regards, Clint