Given that your current schema has ~18 small columns per row, adding a
level by using supercolumns may make sense for you because the
limitation of unserializing a whole supercolumn at once isn't going to
be a problem for you.

20K supercolumns per row with ~18 small subcolumns each is completely
reasonable. The (super)columns within each row will be ordered, and you
can use the much-easier-to-administer RandomPartitioner.

On 2010-05-05 11:22, Denis Haskin wrote:
> David -- thanks for the thoughts.
> 
> In re: your question
>> Does the random partitioner support what you need?
> 
> I guess my answer is "I'm not sure yet", but also my initial thought
> was that we'd use the (or a) OrderPreservingPartitioner so that we
> could use range scans and that rows for a given entity would be
> co-located (if I'm understanding Cassandra's storage architecture
> properly).  But that may be a naive approach.
> 
> In our core data set, we have maybe 20,000 entities about which we are
> storing time-series data (and its fairly well distributed across these
> entities).  Occurs to me it's also possible to store a entity per row,
> with the time-series data as (or in?) super columns (and maybe it
> would make sense to break those out in column families by date range).
>  I'd have to think through a little more what that might mean for our
> secondary indexing needs.
> 
> Thanks,
> 
> dwh
> 
> 
> 
> On Wed, May 5, 2010 at 1:16 AM, David Strauss <da...@fourkitchens.com> wrote:
>> On 2010-05-05 04:50, Denis Haskin wrote:
>>> I've been reading everything I can get my hands on about Cassandra and
>>> it sounds like a possibly very good framework for our data needs; I'm
>>> about to take the plunge and do some prototyping, but I thought I'd
>>> see if I can get a reality check here on whether it makes sense.
>>>
>>> Our schema should be fairly simple; we may only keep our original data
>>> in Cassandra, and the rollups and analyzed results in a relational db
>>> (although this is still open for discussion).
>>
>> This is what we do on some projects. This is a particularly nice
>> strategy if the raw : aggregated ratio is really high or the raw data is
>> bursty or highly volatile.
>>
>> Consider Hadoop integration for your aggregation needs.
>>
>>> We have fairly small records: 120-150 bytes, in maybe 18 columns.
>>> Data is additive only; we would rarely, if ever, be deleting data.
>>
>> Cassandra loves you.
>>
>>> Our core data set will accumulate at somewhere between 14 and 27
>>> million rows per day; we'll be starting with about a year and a half
>>> of data (7.5 - 15 billion rows) and eventually would like to keep 5
>>> years online (25 to 50 billion rows).  (So that's maybe 1.3TB or so
>>> per year, data only.  Not sure about the overhead yet.)
>>>
>>> Ideally we'd like to also have a cluster with our complete data set,
>>> which is maybe 38 billion rows per year (we could live with less than
>>> 5 years of that).
>>>
>>> I haven't really thought through what the schema's going to be; our
>>> primary key is an entity's ID plus a timestamp.  But there's 2 or 3
>>> other retrieval paths we'll need to support as well.
>>
>> Generally, you do multiple retrieval paths through denormalization in
>> Cassandra.
>>
>>> Thoughts?  Pitfalls?  Gotchas? Are we completely whacked?
>>
>> Does the random partitioner support what you need?
>>
>> --
>> David Strauss
>>   | da...@fourkitchens.com
>> Four Kitchens
>>   | http://fourkitchens.com
>>   | +1 512 454 6659 [office]
>>   | +1 512 870 8453 [direct]
>>
>>


-- 
David Strauss
   | da...@fourkitchens.com
   | +1 512 577 5827 [mobile]
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to