Re: Modeling column families

Ryan Rawson Sat, 24 Apr 2010 12:59:57 -0700

On Sat, Apr 24, 2010 at 12:22 AM, Andrey Stepachev <oct...@gmail.com> wrote:
> 2010/4/24 Andrew Nguyen <andrew-lists-hb...@ucsfcti.org>
>
>> Hello all,
>>
>> Each row key is of the form "PatientName-PhysiologicParameter" and each
>> column name is the timestamp of the reading.
>>
>
> With such design in hbase (in opposite to cassandra) you should use row
> filters to get only part of data (for example last year) or use client
> filtering with row scan.
> If data series will be big (>100) you will run in issue of infra row
> scanning https://issues.apache.org/jira/browse/HBASE-1537,
> as I did. Another issue, as mentioned before, is scaling. Hbase splits data
> by rows.
>
> Нou have to figure out how much data will be in a row, and if it counts to
> hundreds, use compound key (patient-code-date),
> If they are small, may be more easy to use will be (patient-code) because
> you can use Get operations with locks (if you need them), and in case of
> dated key, you can't (because scan doesn't yet honor locks).


This statement is happily obsolete - 0.20.4 RC has new code that makes
it so that Gets and Scans never return partially updated rows. I
dislike the term 'honor locks' because it implies an implementation
strategy, and in this case Gets (which are now 1 row scans) and Scans
do not acquire locks to accomplish their tasks.  This is important
because if you acquired a row lock (which is exclusive) you would only
be able to have 1 read and write operation at a time, whereas we
really want 1 write operation and as many read operations.

I really like compound keys because they are a well understood data
modeling problem. People sometimes freak out when they think about
endlessly wide rows, and having this data modeling abstraction really
helps buffer the transition from a relational DB to a non-relational
datastore.  I think you can do it in either way, but I prefer compound
keys and tall tables when the number of operations per user is
expected to be very big.

For example if you are storing timeseries data for a monitoring
system, you might want to store it by row, since the number of points
for a single system might be arbitrarily large (think: 2 years+ of
data). In this case if the expected data set size per row is larger
than what a single machine could conceivably store, Cassandra would
not work for you in this case (since each row must be stored on a
single (er N) node(s)).


>
>
>> Give me all blood pressures for Bob between two dates
>> Give me all blood pressures, and intracranial pressures for Bob from <date>
>> until present
>>
>
> Looks like patient-code-date is preferred way. In you case model can be:
> patient-code-date -> series:value.
>
>
>> In other words, the queries will be very patient-centric, or
>> patient-physiologic parameter-centric.
>>
>> Thanks,
>> Andrew
>

Re: Modeling column families

Reply via email to