Re: Modeling column families

alex kamil Sat, 24 Apr 2010 14:28:57 -0700

Ryan,

wouldn't be storing time series data in chronological order sub-optimal for
sequential scans and range queries
lets say there is a large chunk of data (e.g 10M rows) representing 1hr of
recordings stored in multiple regions on a single node/regionserver
then if we run a range query for that time period we will not utilize the
entire cluster and will be largely IO bound and limited by a single node
read throughput.
i'm thinking of randomizing the input sequence order during insertion to
improve access time


thanks
Alex

On Sat, Apr 24, 2010 at 4:45 PM, Ryan Rawson <ryano...@gmail.com> wrote:

> Hey,
>
> So in my case, timestamp wasnt unique, so I had to put in event id.
> For timeseries systems, you of course wouldnt need to have an
> additional id.  So your first thought where you have:
> <patient id><timestamp>
>
> then putting physiologic parameters in different columns (But the same
> column family) sounds great to me.  This is a good example of where
> flexible schema is good, since you can store any number of parameters
> per row, but only the ones you want.
>
> As for HBase and multi-datacenter, there is work underway by my
> colleague JD to write a replication system.  It's in the late stages
> and we are hoping to get it into advanced testing soon.  Practically
> speaking you dont want to split your HDFS and HBase cluster across a
> datacenter.
>
> On Sat, Apr 24, 2010 at 1:36 PM, Andrew Nguyen
> <andrew-lists-hb...@ucsfcti.org> wrote:
> > Ryan,
> >
> > Extremely helpful, and definitely something to think about.  My intuition
> says the row-oriented approach is much better for us since there's a
> (potentially) unbounded amount of data being fed into the system.
> >
> > In your eventId example, what was your main reason for not using eventId
> as a column name?  Is it a too large of a set?  Or, were there other factors
> affecting your decision?
> >
> > I'm asking because given your advice so far, I'm considering the
> following for my key schema:
> >
> > <patient id><timestamp>
> >
> > And then having each physiologic parameter be a column.  The set is
> fairly small, right now there are about 40-70 parameters, though this may
> increase.  It also varies from patient to patient since they are not all
> hooked up to the same machines.
> >
> > The alternative is to go what you have done with eventId and have the
> following be my schema:
> >
> > <patient id><timestamp><signal id>
> >
> > So, I'm trying to figure out what questions I need to ask in order to
> make the right decisions.  I definitely think the row-oriented approach has
> great benefit here, based on what I'm learning so far, mostly from the
> scalability standpoint.  One of the other things we're considering is
> splitting the cluster across two datacenters (one in San Francisco and one
> in San Diego) since there's really no feasible way to back up the amount of
> data we're anticipating.  I haven't looked into this much for HDFS either
> and I'm not sure how this factors into the splitting for HBase.
> >
> > In terms of queries, most of our queries would probably be:
> >
> > All values for a subset of signals for a particular patient in a given
> date range
> > All values for a subset of signals for a particular patient
> > All values for all signals for a particular patient in a given date range
> > All values for all signals for a particular patient
> >
> > These would probably be the most common though people may find new ways
> to use the data.
> >
> > Thanks!
>

Re: Modeling column families

Reply via email to