Hi,

I have an application that consists of multiple (possible 1000's) of
measurement series, and each measurement series generates a small amount of
data output (only about 500 bytes) every 10 seconds. This time series of
data should be stored in Cassandra in a fashion that both read access is
possible for a given time range.

What I do today is
   - assign a timeuuid to each data output
   - write in two CF:
         - first CF has key = measurement series ID, column name =
timeuuid_of_output
         - second CF has key = timeuuid_of_output, column value = data
output (~ 500 bytes)

When someone requests a time range of data, I read the first CF, get a
series of timeuuid's, and then do a row-multiget on the second CF.

This works great, but tends to be slow for big series of data (lets say for
10 days, nearly 100,000 records will be requested from the second CF). This
load of 100,000 reads will be distributed through the cluster (because the
second CF scales very nicely with a RandomPartitioner), but more or less
one ends up with 100,000 individual read requests, at least that's what I
suspect.

Can anyone say if there is a better data model for this type of queries?
Would it be a reasonable improvement to put all data to a single CF with

   - single CF, key = measurement series ID, column name =
timeuuid_of_output, column value = data output

When I request a series of 100,000 columns from this row (now it's a single
row), can the performance really be better? Is there any chance that
Cassandra will be able to read this data "en bloc" from the hard drive?

Any advise is appreciated...

Greetings,
Roland

Reply via email to