I like to model this kind of data as columns, where the timestamps are
the column name (either longs, TimeUUIDs, or string depending on your
usage). If you have too much data for a single row, you'd need to have
multiple rows of these. For time-series data, it makes sense to use
one row per minute/hour/day/year depending on the volume of your data.

Something like the following:

SomeTimeData: { // columnfamily
  "20100601": { // key, yyyymmdd
    123456789: "value1", // column name is milliseconds since epoch
    123456799: "value2"
  },
  "20100602": {
    12345889: "value3"
  }
}

Now you can use column slices to retrieve all values between two time
periods on a given day. If you need to support larger ranges you'll
either have to slice columns from multiple keys or change the keys
from yyyymmdd to yyyymm, yyyy, etc. There's a tradeoff here between
row width and read speed. Reading 1000 columns as a continuous slice
from a single row will be very fast but reading 1000 columns as slices
from 10 keys won't be as fast.

Ben

On Wed, Jun 2, 2010 at 11:32 AM, David Boxenhorn <da...@lookin2.com> wrote:
> How do I handle giant sets of ordered data, e.g. by timestamps, which I want
> to access by range?
>
> I can't put all the data into a supercolumn, because it's loaded into memory
> at once, and it's too much data.
>
> Am I forced to use an order-preserving partitioner? I don't want the
> headache. Is there any other way?
>

Reply via email to