Ryan,

Extremely helpful, and definitely something to think about.  My intuition says 
the row-oriented approach is much better for us since there's a (potentially) 
unbounded amount of data being fed into the system.

In your eventId example, what was your main reason for not using eventId as a 
column name?  Is it a too large of a set?  Or, were there other factors 
affecting your decision?

I'm asking because given your advice so far, I'm considering the following for 
my key schema:

<patient id><timestamp>

And then having each physiologic parameter be a column.  The set is fairly 
small, right now there are about 40-70 parameters, though this may increase.  
It also varies from patient to patient since they are not all hooked up to the 
same machines.

The alternative is to go what you have done with eventId and have the following 
be my schema:

<patient id><timestamp><signal id>

So, I'm trying to figure out what questions I need to ask in order to make the 
right decisions.  I definitely think the row-oriented approach has great 
benefit here, based on what I'm learning so far, mostly from the scalability 
standpoint.  One of the other things we're considering is splitting the cluster 
across two datacenters (one in San Francisco and one in San Diego) since 
there's really no feasible way to back up the amount of data we're 
anticipating.  I haven't looked into this much for HDFS either and I'm not sure 
how this factors into the splitting for HBase.

In terms of queries, most of our queries would probably be:

All values for a subset of signals for a particular patient in a given date 
range
All values for a subset of signals for a particular patient
All values for all signals for a particular patient in a given date range
All values for all signals for a particular patient

These would probably be the most common though people may find new ways to use 
the data.

Thanks!

Reply via email to