Re: Effective partition key for time series data, which allows range queries?

Jim Ancona Wed, 05 Apr 2017 07:19:01 -0700

That's an interesting refinement! I'll keep it in mind the next time this
sort of thing comes up.


Jim

On Wed, Apr 5, 2017 at 9:22 AM, Eric Stevens <migh...@gmail.com> wrote:

> Jim's basic model is similar to how we've solved this exact kind of
> problem many times.  From my own experience, I strongly recommend that you
> make a `bucket` field in the partition key, and a `time` field in the
> cluster key.  Make both of these of data type `timestamp`.  Then use
> application logic to floor the bucket to an appropriate interval according
> to your chosen bucket size.
>
> The reason is that as your data needs grow, the one thing you can be
> pretty confident in is that your assumptions about data density per
> partition will turn out to be eventually wrong.  This is either because of
> expanding requirements (you're adding new fields to this table), because of
> increased application usage (you're being successful!), or because you
> didn't anticipate a use case with different data density per bucket than
> you anticipated (you're not prescient).
>
> It's easy in application code to adjust your timestamp interval if your
> keying allows for arbitrary adjustments.  Most often you're going to end up
> making smaller buckets as your needs progress.  The upshot is that with a
> little careful selection of bucketing strategy, partition key range
> iterations are still correct if you adjust from say a 24 hour bucket to a
> 12 hour, 6 hour, 3 hour, 1 hour, 30 minute, 15 minute, or 1 minute bucket.
> The data written under the larger bucket size still lands on a smaller
> bucket interval, so you don't really even need to use complex logic in the
> application to adapt to the new bucket size.  You definitely don't want to
> paint yourself into a corner where you need a smaller bucket size but your
> data model didn't leave room for it.
>
> On Tue, Apr 4, 2017 at 2:59 PM Jim Ancona <j...@anconafamily.com> wrote:
>
>> The typical recommendation for maximum partition size is on the order of
>> 100mb and/or 100,000 rows. That's not a hard limit, but you may be setting
>> yourself up for issues as you approach or exceed those numbers.
>>
>> If you need to reduce partition size, the typical way to do this is by
>> "bucketing," that is adding a synthetic column to the partition key to
>> separate the data into separate buckets. In your example above, I assume
>> that your current primary key is (user, timestamp), where user is the
>> partition key and timestamp is the clustering key. Say that you want to
>> store a maximum of a years worth of data in a partition. You would make the
>> primary key be ((user, year), timestamp). The partition key is now
>> (user, year) where year is the year part of the timestamp. Now if you want
>> to query the data for last month, you would do:
>>
>> select * from data where user_id = 'foo' and year = 2017 and timestamp >=
>> '<1 month ago>' and timestamp <= '<today>'
>>
>>
>> If you wanted to get the data for the last 6 month, you'd do something
>> like:
>>
>> select * from data where user_id = 'foo' and year in (2016, 2017) and
>> timestamp >= '<6 months ago>' and timestamp <= '<today>'  (Notice that
>> because the query spans two years, you have to include both years in the
>> select criteria so that C* knows which partitions to query. )
>>
>>
>> You can make the buckets smaller (e.g. weeks, days, hours instead of
>> years), but of course querying multiple buckets is less efficient, so it's
>> worth making your buckets as large as you can without making them too big.
>>
>> Hope this helps!
>>
>> Jim
>>
>>
>>
>>
>> On Mon, Mar 27, 2017 at 8:47 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>> I have a use case where the data for individual users is being tracked,
>> and every 15 minutes or so, the data for the past 15 minutes is inserted
>> into the table.
>>
>> The table schema looks like:
>> user id, timestamp, foo, bar, etc.
>>
>> Where foo, bar, etc are the items being tracked, and their values over
>> the past 15 minutes.
>>
>> I initially planned to use the user id as the primary key of the table.
>> But, I realized that this may cause really wide rows ( tracking for 24
>> hours means 96 records inserted (1 for each 15 min window), over 1 year
>> this means 36k records per user, over 2 years, 72k, etc).
>>
>> I know the  limit of wide rows is billions of records, but I've heard
>> that the practical limit is much lower.
>>
>> So I considered using a composite primary key: (user, timestamp)
>>
>> If I'm correct, the above should create a new row for each user &
>> timestamp logged.
>>
>> However, will i still be able to do range queries on the timestamp, to
>> e.g return the data for the last week?
>>
>> E.g select * from data where user_id = 'foo' and timestamp >= '<1 month
>> ago>' and timestamp <= '<today>' ?
>>
>>
>>

Re: Effective partition key for time series data, which allows range queries?

Reply via email to