Hey, So in my case, timestamp wasnt unique, so I had to put in event id. For timeseries systems, you of course wouldnt need to have an additional id. So your first thought where you have: <patient id><timestamp>
then putting physiologic parameters in different columns (But the same column family) sounds great to me. This is a good example of where flexible schema is good, since you can store any number of parameters per row, but only the ones you want. As for HBase and multi-datacenter, there is work underway by my colleague JD to write a replication system. It's in the late stages and we are hoping to get it into advanced testing soon. Practically speaking you dont want to split your HDFS and HBase cluster across a datacenter. On Sat, Apr 24, 2010 at 1:36 PM, Andrew Nguyen <andrew-lists-hb...@ucsfcti.org> wrote: > Ryan, > > Extremely helpful, and definitely something to think about. My intuition > says the row-oriented approach is much better for us since there's a > (potentially) unbounded amount of data being fed into the system. > > In your eventId example, what was your main reason for not using eventId as a > column name? Is it a too large of a set? Or, were there other factors > affecting your decision? > > I'm asking because given your advice so far, I'm considering the following > for my key schema: > > <patient id><timestamp> > > And then having each physiologic parameter be a column. The set is fairly > small, right now there are about 40-70 parameters, though this may increase. > It also varies from patient to patient since they are not all hooked up to > the same machines. > > The alternative is to go what you have done with eventId and have the > following be my schema: > > <patient id><timestamp><signal id> > > So, I'm trying to figure out what questions I need to ask in order to make > the right decisions. I definitely think the row-oriented approach has great > benefit here, based on what I'm learning so far, mostly from the scalability > standpoint. One of the other things we're considering is splitting the > cluster across two datacenters (one in San Francisco and one in San Diego) > since there's really no feasible way to back up the amount of data we're > anticipating. I haven't looked into this much for HDFS either and I'm not > sure how this factors into the splitting for HBase. > > In terms of queries, most of our queries would probably be: > > All values for a subset of signals for a particular patient in a given date > range > All values for a subset of signals for a particular patient > All values for all signals for a particular patient in a given date range > All values for all signals for a particular patient > > These would probably be the most common though people may find new ways to > use the data. > > Thanks!