If the rowkey is date/time and the data is original sequential by date/time, when load/insert data into table, only one region (the one node) is active to receive new data. The load performance will be bad.
On Wed, Apr 1, 2009 at 11:08 AM, zsongbo <[email protected]> wrote: > If the rowkey is date/time and the data is original sequential by > date/time, when load/insert data into table, only one region (the one > node) is active to receive new data. The load performance will be pool. > > > On Wed, Apr 1, 2009 at 10:25 AM, Bradford Cross < > [email protected]> wrote: > >> Greetings, >> >> I am prototyping a financial time series database on top of HBase and >> trying >> to head my head around what a good design would look like. >> >> As I understand it, I have rows, column families, columns and cells. >> >> Since the only think that Hbase really "indexes" is row keys, it seems >> natural in a way to represent the rowkeys as the date/time. >> >> As a simple example: >> >> Bar data: >> >> { >> "2009/1/17" : { >> "open":"100", >> "high":"102", >> "low":"99", >> "close":"101" >> "volume":"1000256" >> } >> } >> >> >> Quote data: >> >> { >> "2009/1/17:11:23:04" : { >> "bid":"100.01", >> "ask":"100.02", >> "bidsize":"10000", >> "asksize":"100200" >> } >> } >> >> But there are many other issues to think about. >> >> In financial time series data we have small amounts of data within each >> "observation" and we can have lots of observations. We can have millions >> of >> observations per time series (f.ex. all historical trade and quote date >> for >> a particular stock since 1993)across hundreds of thousands of individual >> instruments (f.ex. across all stocks that have traded since 1993.) >> >> The write patterns fit HBase nicely, because it is a write once and append >> pattern. This is followed by loads of offline processes for simulating >> trading models and such. These query patterns look like "all quotes for >> all >> stocks between the dates of 1/1/996 and 12/31/2008." So the querying is >> typically across a date range, and we can further filter the query by >> instrument types. >> >> So I am not sure what makes sense for efficiency because I do not >> understand >> HBase well enough yet. >> >> What kinds of mixes of rows, column families, and columns should I be >> thinking about? >> >> Does my simplistic approach make any sense? That would mean each row is a >> key-value pair where the key is is the date/time and the value is the >> "observation." I suppose this leads to a "table per time series" model. >> Does that make sense or is there overhead to having lots of tables? >> > >
