Greetings,

I am prototyping a financial time series database on top of HBase and trying
to head my head around what a good design would look like.

As I understand it, I have rows, column families, columns and cells.

Since the only think that Hbase really "indexes" is row keys, it seems
natural in a way to represent the rowkeys as the date/time.

As a simple example:

Bar data:

{
   "2009/1/17" : {
     "open":"100",
     "high":"102",
     "low":"99",
     "close":"101"
     "volume":"1000256"
   }
}


Quote data:

{
   "2009/1/17:11:23:04" : {
     "bid":"100.01",
     "ask":"100.02",
     "bidsize":"10000",
     "asksize":"100200"
   }
}

But there are many other issues to think about.

In financial time series data we have small amounts of data within each
"observation" and we can have lots of observations.  We can have millions of
observations per time series (f.ex. all historical trade and quote date for
a particular stock since 1993)across hundreds of thousands of individual
instruments (f.ex. across all stocks that have traded since 1993.)

The write patterns fit HBase nicely, because it is a write once and append
pattern.  This is followed by loads of offline processes for simulating
trading models and such.  These query patterns look like "all quotes for all
stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
typically across a date range, and we can further filter the query by
instrument types.

So I am not sure what makes sense for efficiency because I do not understand
HBase well enough yet.

 What kinds of mixes of rows, column families, and columns should I be
thinking about?

Does my simplistic approach make any sense?  That would mean each row is a
key-value pair where the key is is the date/time and the value is the
"observation."  I suppose this leads to a "table per time series" model.
Does that make sense or is there overhead to having lots of tables?

Reply via email to