That's an interesting refinement! I'll keep it in mind the next time this sort of thing comes up.
Jim On Wed, Apr 5, 2017 at 9:22 AM, Eric Stevens <migh...@gmail.com> wrote: > Jim's basic model is similar to how we've solved this exact kind of > problem many times. From my own experience, I strongly recommend that you > make a `bucket` field in the partition key, and a `time` field in the > cluster key. Make both of these of data type `timestamp`. Then use > application logic to floor the bucket to an appropriate interval according > to your chosen bucket size. > > The reason is that as your data needs grow, the one thing you can be > pretty confident in is that your assumptions about data density per > partition will turn out to be eventually wrong. This is either because of > expanding requirements (you're adding new fields to this table), because of > increased application usage (you're being successful!), or because you > didn't anticipate a use case with different data density per bucket than > you anticipated (you're not prescient). > > It's easy in application code to adjust your timestamp interval if your > keying allows for arbitrary adjustments. Most often you're going to end up > making smaller buckets as your needs progress. The upshot is that with a > little careful selection of bucketing strategy, partition key range > iterations are still correct if you adjust from say a 24 hour bucket to a > 12 hour, 6 hour, 3 hour, 1 hour, 30 minute, 15 minute, or 1 minute bucket. > The data written under the larger bucket size still lands on a smaller > bucket interval, so you don't really even need to use complex logic in the > application to adapt to the new bucket size. You definitely don't want to > paint yourself into a corner where you need a smaller bucket size but your > data model didn't leave room for it. > > On Tue, Apr 4, 2017 at 2:59 PM Jim Ancona <j...@anconafamily.com> wrote: > >> The typical recommendation for maximum partition size is on the order of >> 100mb and/or 100,000 rows. That's not a hard limit, but you may be setting >> yourself up for issues as you approach or exceed those numbers. >> >> If you need to reduce partition size, the typical way to do this is by >> "bucketing," that is adding a synthetic column to the partition key to >> separate the data into separate buckets. In your example above, I assume >> that your current primary key is (user, timestamp), where user is the >> partition key and timestamp is the clustering key. Say that you want to >> store a maximum of a years worth of data in a partition. You would make the >> primary key be ((user, year), timestamp). The partition key is now >> (user, year) where year is the year part of the timestamp. Now if you want >> to query the data for last month, you would do: >> >> select * from data where user_id = 'foo' and year = 2017 and timestamp >= >> '<1 month ago>' and timestamp <= '<today>' >> >> >> If you wanted to get the data for the last 6 month, you'd do something >> like: >> >> select * from data where user_id = 'foo' and year in (2016, 2017) and >> timestamp >= '<6 months ago>' and timestamp <= '<today>' (Notice that >> because the query spans two years, you have to include both years in the >> select criteria so that C* knows which partitions to query. ) >> >> >> You can make the buckets smaller (e.g. weeks, days, hours instead of >> years), but of course querying multiple buckets is less efficient, so it's >> worth making your buckets as large as you can without making them too big. >> >> Hope this helps! >> >> Jim >> >> >> >> >> On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: >> >> I have a use case where the data for individual users is being tracked, >> and every 15 minutes or so, the data for the past 15 minutes is inserted >> into the table. >> >> The table schema looks like: >> user id, timestamp, foo, bar, etc. >> >> Where foo, bar, etc are the items being tracked, and their values over >> the past 15 minutes. >> >> I initially planned to use the user id as the primary key of the table. >> But, I realized that this may cause really wide rows ( tracking for 24 >> hours means 96 records inserted (1 for each 15 min window), over 1 year >> this means 36k records per user, over 2 years, 72k, etc). >> >> I know the limit of wide rows is billions of records, but I've heard >> that the practical limit is much lower. >> >> So I considered using a composite primary key: (user, timestamp) >> >> If I'm correct, the above should create a new row for each user & >> timestamp logged. >> >> However, will i still be able to do range queries on the timestamp, to >> e.g return the data for the last week? >> >> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month >> ago>' and timestamp <= '<today>' ? >> >> >>