Re: Time-series data model

Dan Di Spaltro Thu, 15 Apr 2010 11:08:43 -0700

This is actually fairly similar to how we store metrics at Cloudkick.
Below has a much more in depth explanation of some of that


https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

So we store each natural point in the NumericArchive table.

<ColumnFamily CompareWith="LongType"
              Name="NumericArchive" />

<ColumnFamily CompareWith="LongType" Name="Rollup5m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup20m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup30m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup60m"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup4h"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup12h"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />
<ColumnFamily CompareWith="LongType" Name="Rollup1d"
ColumnType="Super" CompareSubcolumnsWith="BytesType" />

our keys look like:
<serviceuuid>.<metric-name>

Anyways, this has been working out very well for us.

2010/4/15 Ted Zlatanov <t...@lifelogs.com>:
> On Thu, 15 Apr 2010 11:27:47 +0200 Jean-Pierre Bergamin <ja...@ractive.ch> 
> wrote:
>
> JB> Am 14.04.2010 15:22, schrieb Ted Zlatanov:
>>> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"<ja...@ractive.ch> 
>>>  wrote:
>>>
> JB> The metrics are stored together with a timestamp. The queries we want to
> JB> perform are:
> JB> * The last value of a specific metric of a device
> JB> * The values of a specific metric of a device between two timestamps t1 
> and
> JB> t2
>>>
>>> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time
>>> sharding makes sense to you; I use UTC by-hours and by-day in my
>>> environment).  Then your supercolumn is the collection time as a
>>> LongType and your columns inside the supercolumn can express the metric
>>> in detail (collector agent, detailed breakdown, etc.).
>>>
> JB> Just for my understanding. What is "time sharding"? I couldn't find an
> JB> explanation somewhere. Do you mean that the time-series data is rolled
> JB> up in 5 minues, 1 hour, 1 day etc. slices?
>
> Yes.  The usual meaning of "shard" in RDBMS world is to segment your
> database by some criteria, e.g. US vs. Europe in Amazon AWS because
> their data centers are laid out so.  I was taking a linguistic shortcut
> to mean "break down your rows by some convenient criteria."  You can
> actually set up your Partitioner in Cassandra to literally shard your
> keyspace rows based on the key, but I just meant "slice" in my note.
>
> JB> So this would be defined as:
> JB> <ColumnFamily Name="measurements" ColumnType="Super"
> JB> CompareWith="UTF8Type"  CompareSubcolumnsWith="LongType" />
>
> JB> So when i want to read all values of one metric between two timestamps
> JB> t0 and t1, I'd have to read the supercolumns that match a key range
> JB> (device1:metric1:t0 - device1:metric1:t1) and then all the
> JB> supercolumns for this key?
>
> Yes.  This is a single multiget if you can construct the key range
> explicitly.  Cassandra loads a lot of this in memory already and filters
> it after the fact, that's why it pays to slice your keys and to stitch
> them together on the client side if you have to go across a time
> boundary.  You'll also get better key load balancing with deeper slicing
> if you use the randomizing partitioner.
>
> In the result set, you'll get each matching supercolumn with all the
> columns inside it.  You may have to page through supercolumns.
>
> Ted
>
>



-- 
Dan Di Spaltro

Re: Time-series data model

Reply via email to