How many columns will you have? Until we fix https://issues.apache.org/jira/browse/HBASE-867, you are limited regards the number of columns you can have. St.Ack
On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <[email protected]>wrote: > Based on reading the hbase architecture wiki, I have changed my thinking > due > to the "Column Family Centric Storage." > > HBase stores column families physically close on disk, so the items in a > given column family should have roughly the same read/write characteristics > and contain similar data. Although at a conceptual level, tables may be > viewed as a sparse set of rows, physically they are stored on a per-column > family basis. This is an important consideration for schema and application > designers to keep in mind. > > This leads me to the thought of keeping an entire time series inside a > single column family. > > Options: > > Row key is a ticker symbol: > - hijack time stamp to be the time of each observation. Use a column > family > to hold all the data, and a column for each property of each observation. > -don't hijack the time stamp, just ignore it. Use a column family for all > the data, and use an individual column for the date/time of the > observation, > and individual columns for each property of each observation. > > thoughts? > > On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross > <[email protected]>wrote: > > > Greetings, > > > > I am prototyping a financial time series database on top of HBase and > > trying to head my head around what a good design would look like. > > > > As I understand it, I have rows, column families, columns and cells. > > > > Since the only think that Hbase really "indexes" is row keys, it seems > > natural in a way to represent the rowkeys as the date/time. > > > > As a simple example: > > > > Bar data: > > > > { > > "2009/1/17" : { > > "open":"100", > > "high":"102", > > "low":"99", > > "close":"101" > > "volume":"1000256" > > } > > } > > > > > > Quote data: > > > > { > > "2009/1/17:11:23:04" : { > > "bid":"100.01", > > "ask":"100.02", > > "bidsize":"10000", > > "asksize":"100200" > > } > > } > > > > But there are many other issues to think about. > > > > In financial time series data we have small amounts of data within each > > "observation" and we can have lots of observations. We can have millions > of > > observations per time series (f.ex. all historical trade and quote date > for > > a particular stock since 1993)across hundreds of thousands of individual > > instruments (f.ex. across all stocks that have traded since 1993.) > > > > The write patterns fit HBase nicely, because it is a write once and > append > > pattern. This is followed by loads of offline processes for simulating > > trading models and such. These query patterns look like "all quotes for > all > > stocks between the dates of 1/1/996 and 12/31/2008." So the querying is > > typically across a date range, and we can further filter the query by > > instrument types. > > > > So I am not sure what makes sense for efficiency because I do not > > understand HBase well enough yet. > > > > What kinds of mixes of rows, column families, and columns should I be > > thinking about? > > > > Does my simplistic approach make any sense? That would mean each row is > a > > key-value pair where the key is is the date/time and the value is the > > "observation." I suppose this leads to a "table per time series" model. > > Does that make sense or is there overhead to having lots of tables? > > >
