Ryan, wouldn't be storing time series data in chronological order sub-optimal for sequential scans and range queries lets say there is a large chunk of data (e.g 10M rows) representing 1hr of recordings stored in multiple regions on a single node/regionserver then if we run a range query for that time period we will not utilize the entire cluster and will be largely IO bound and limited by a single node read throughput. i'm thinking of randomizing the input sequence order during insertion to improve access time
thanks Alex On Sat, Apr 24, 2010 at 4:45 PM, Ryan Rawson <ryano...@gmail.com> wrote: > Hey, > > So in my case, timestamp wasnt unique, so I had to put in event id. > For timeseries systems, you of course wouldnt need to have an > additional id. So your first thought where you have: > <patient id><timestamp> > > then putting physiologic parameters in different columns (But the same > column family) sounds great to me. This is a good example of where > flexible schema is good, since you can store any number of parameters > per row, but only the ones you want. > > As for HBase and multi-datacenter, there is work underway by my > colleague JD to write a replication system. It's in the late stages > and we are hoping to get it into advanced testing soon. Practically > speaking you dont want to split your HDFS and HBase cluster across a > datacenter. > > On Sat, Apr 24, 2010 at 1:36 PM, Andrew Nguyen > <andrew-lists-hb...@ucsfcti.org> wrote: > > Ryan, > > > > Extremely helpful, and definitely something to think about. My intuition > says the row-oriented approach is much better for us since there's a > (potentially) unbounded amount of data being fed into the system. > > > > In your eventId example, what was your main reason for not using eventId > as a column name? Is it a too large of a set? Or, were there other factors > affecting your decision? > > > > I'm asking because given your advice so far, I'm considering the > following for my key schema: > > > > <patient id><timestamp> > > > > And then having each physiologic parameter be a column. The set is > fairly small, right now there are about 40-70 parameters, though this may > increase. It also varies from patient to patient since they are not all > hooked up to the same machines. > > > > The alternative is to go what you have done with eventId and have the > following be my schema: > > > > <patient id><timestamp><signal id> > > > > So, I'm trying to figure out what questions I need to ask in order to > make the right decisions. I definitely think the row-oriented approach has > great benefit here, based on what I'm learning so far, mostly from the > scalability standpoint. One of the other things we're considering is > splitting the cluster across two datacenters (one in San Francisco and one > in San Diego) since there's really no feasible way to back up the amount of > data we're anticipating. I haven't looked into this much for HDFS either > and I'm not sure how this factors into the splitting for HBase. > > > > In terms of queries, most of our queries would probably be: > > > > All values for a subset of signals for a particular patient in a given > date range > > All values for a subset of signals for a particular patient > > All values for all signals for a particular patient in a given date range > > All values for all signals for a particular patient > > > > These would probably be the most common though people may find new ways > to use the data. > > > > Thanks! >