I should also state that apart from the hbase inadequacy, your schema looks good (hbase should be able to carry this schema-type w/o sweat -- hopefully 0.20.0). St.Ack
On Thu, Apr 2, 2009 at 9:12 AM, stack <[email protected]> wrote: > How many columns will you have? Until we fix > https://issues.apache.org/jira/browse/HBASE-867, you are limited regards > the number of columns you can have. > St.Ack > > > On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <[email protected] > > wrote: > >> Based on reading the hbase architecture wiki, I have changed my thinking >> due >> to the "Column Family Centric Storage." >> >> HBase stores column families physically close on disk, so the items in a >> given column family should have roughly the same read/write >> characteristics >> and contain similar data. Although at a conceptual level, tables may be >> viewed as a sparse set of rows, physically they are stored on a per-column >> family basis. This is an important consideration for schema and >> application >> designers to keep in mind. >> >> This leads me to the thought of keeping an entire time series inside a >> single column family. >> >> Options: >> >> Row key is a ticker symbol: >> - hijack time stamp to be the time of each observation. Use a column >> family >> to hold all the data, and a column for each property of each observation. >> -don't hijack the time stamp, just ignore it. Use a column family for all >> the data, and use an individual column for the date/time of the >> observation, >> and individual columns for each property of each observation. >> >> thoughts? >> >> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross >> <[email protected]>wrote: >> >> > Greetings, >> > >> > I am prototyping a financial time series database on top of HBase and >> > trying to head my head around what a good design would look like. >> > >> > As I understand it, I have rows, column families, columns and cells. >> > >> > Since the only think that Hbase really "indexes" is row keys, it seems >> > natural in a way to represent the rowkeys as the date/time. >> > >> > As a simple example: >> > >> > Bar data: >> > >> > { >> > "2009/1/17" : { >> > "open":"100", >> > "high":"102", >> > "low":"99", >> > "close":"101" >> > "volume":"1000256" >> > } >> > } >> > >> > >> > Quote data: >> > >> > { >> > "2009/1/17:11:23:04" : { >> > "bid":"100.01", >> > "ask":"100.02", >> > "bidsize":"10000", >> > "asksize":"100200" >> > } >> > } >> > >> > But there are many other issues to think about. >> > >> > In financial time series data we have small amounts of data within each >> > "observation" and we can have lots of observations. We can have >> millions of >> > observations per time series (f.ex. all historical trade and quote date >> for >> > a particular stock since 1993)across hundreds of thousands of individual >> > instruments (f.ex. across all stocks that have traded since 1993.) >> > >> > The write patterns fit HBase nicely, because it is a write once and >> append >> > pattern. This is followed by loads of offline processes for simulating >> > trading models and such. These query patterns look like "all quotes for >> all >> > stocks between the dates of 1/1/996 and 12/31/2008." So the querying is >> > typically across a date range, and we can further filter the query by >> > instrument types. >> > >> > So I am not sure what makes sense for efficiency because I do not >> > understand HBase well enough yet. >> > >> > What kinds of mixes of rows, column families, and columns should I be >> > thinking about? >> > >> > Does my simplistic approach make any sense? That would mean each row is >> a >> > key-value pair where the key is is the date/time and the value is the >> > "observation." I suppose this leads to a "table per time series" model. >> > Does that make sense or is there overhead to having lots of tables? >> > >> > >
