This is actually fairly similar to how we store metrics at Cloudkick. Below has a much more in depth explanation of some of that
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/ So we store each natural point in the NumericArchive table. <ColumnFamily CompareWith="LongType" Name="NumericArchive" /> <ColumnFamily CompareWith="LongType" Name="Rollup5m" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> <ColumnFamily CompareWith="LongType" Name="Rollup20m" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> <ColumnFamily CompareWith="LongType" Name="Rollup30m" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> <ColumnFamily CompareWith="LongType" Name="Rollup60m" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> <ColumnFamily CompareWith="LongType" Name="Rollup4h" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> <ColumnFamily CompareWith="LongType" Name="Rollup12h" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> <ColumnFamily CompareWith="LongType" Name="Rollup1d" ColumnType="Super" CompareSubcolumnsWith="BytesType" /> our keys look like: <serviceuuid>.<metric-name> Anyways, this has been working out very well for us. 2010/4/15 Ted Zlatanov <t...@lifelogs.com>: > On Thu, 15 Apr 2010 11:27:47 +0200 Jean-Pierre Bergamin <ja...@ractive.ch> > wrote: > > JB> Am 14.04.2010 15:22, schrieb Ted Zlatanov: >>> On Wed, 14 Apr 2010 15:02:29 +0200 "Jean-Pierre Bergamin"<ja...@ractive.ch> >>> wrote: >>> > JB> The metrics are stored together with a timestamp. The queries we want to > JB> perform are: > JB> * The last value of a specific metric of a device > JB> * The values of a specific metric of a device between two timestamps t1 > and > JB> t2 >>> >>> Make your key "devicename-metricname-YYYYMMDD-HHMM" (with whatever time >>> sharding makes sense to you; I use UTC by-hours and by-day in my >>> environment). Then your supercolumn is the collection time as a >>> LongType and your columns inside the supercolumn can express the metric >>> in detail (collector agent, detailed breakdown, etc.). >>> > JB> Just for my understanding. What is "time sharding"? I couldn't find an > JB> explanation somewhere. Do you mean that the time-series data is rolled > JB> up in 5 minues, 1 hour, 1 day etc. slices? > > Yes. The usual meaning of "shard" in RDBMS world is to segment your > database by some criteria, e.g. US vs. Europe in Amazon AWS because > their data centers are laid out so. I was taking a linguistic shortcut > to mean "break down your rows by some convenient criteria." You can > actually set up your Partitioner in Cassandra to literally shard your > keyspace rows based on the key, but I just meant "slice" in my note. > > JB> So this would be defined as: > JB> <ColumnFamily Name="measurements" ColumnType="Super" > JB> CompareWith="UTF8Type" CompareSubcolumnsWith="LongType" /> > > JB> So when i want to read all values of one metric between two timestamps > JB> t0 and t1, I'd have to read the supercolumns that match a key range > JB> (device1:metric1:t0 - device1:metric1:t1) and then all the > JB> supercolumns for this key? > > Yes. This is a single multiget if you can construct the key range > explicitly. Cassandra loads a lot of this in memory already and filters > it after the fact, that's why it pays to slice your keys and to stitch > them together on the client side if you have to go across a time > boundary. You'll also get better key load balancing with deeper slicing > if you use the randomizing partitioner. > > In the result set, you'll get each matching supercolumn with all the > columns inside it. You may have to page through supercolumns. > > Ted > > -- Dan Di Spaltro