On Sat, Apr 24, 2010 at 12:22 AM, Andrey Stepachev <oct...@gmail.com> wrote: > 2010/4/24 Andrew Nguyen <andrew-lists-hb...@ucsfcti.org> > >> Hello all, >> >> Each row key is of the form "PatientName-PhysiologicParameter" and each >> column name is the timestamp of the reading. >> > > With such design in hbase (in opposite to cassandra) you should use row > filters to get only part of data (for example last year) or use client > filtering with row scan. > If data series will be big (>100) you will run in issue of infra row > scanning https://issues.apache.org/jira/browse/HBASE-1537, > as I did. Another issue, as mentioned before, is scaling. Hbase splits data > by rows. > > Нou have to figure out how much data will be in a row, and if it counts to > hundreds, use compound key (patient-code-date), > If they are small, may be more easy to use will be (patient-code) because > you can use Get operations with locks (if you need them), and in case of > dated key, you can't (because scan doesn't yet honor locks).
This statement is happily obsolete - 0.20.4 RC has new code that makes it so that Gets and Scans never return partially updated rows. I dislike the term 'honor locks' because it implies an implementation strategy, and in this case Gets (which are now 1 row scans) and Scans do not acquire locks to accomplish their tasks. This is important because if you acquired a row lock (which is exclusive) you would only be able to have 1 read and write operation at a time, whereas we really want 1 write operation and as many read operations. I really like compound keys because they are a well understood data modeling problem. People sometimes freak out when they think about endlessly wide rows, and having this data modeling abstraction really helps buffer the transition from a relational DB to a non-relational datastore. I think you can do it in either way, but I prefer compound keys and tall tables when the number of operations per user is expected to be very big. For example if you are storing timeseries data for a monitoring system, you might want to store it by row, since the number of points for a single system might be arbitrarily large (think: 2 years+ of data). In this case if the expected data set size per row is larger than what a single machine could conceivably store, Cassandra would not work for you in this case (since each row must be stored on a single (er N) node(s)). > > >> Give me all blood pressures for Bob between two dates >> Give me all blood pressures, and intracranial pressures for Bob from <date> >> until present >> > > Looks like patient-code-date is preferred way. In you case model can be: > patient-code-date -> series:value. > > >> In other words, the queries will be very patient-centric, or >> patient-physiologic parameter-centric. >> >> Thanks, >> Andrew >