Cool, so the schema I am leaning toward is: -hijack time stamp to be the time of each observation. Use a column family to hold all the data, and a column for each property of each observation.
Since HBase sorts the timestamps descending, it seems like hijacking the timestamps makes sense. Any performance implications of this that I should be aware of? Hijacking the time stamps seems to be fairly intuitive, and leverages the time stamps which I otherwise would not really care about if I just ignored timestamps and dumped all data including the date/time of observations into columns. Are there any downsides to hijacking the timestamps like this? On Thu, Apr 2, 2009 at 12:13 AM, stack <[email protected]> wrote: > I should also state that apart from the hbase inadequacy, your schema looks > good (hbase should be able to carry this schema-type w/o sweat -- hopefully > 0.20.0). > St.Ack > > On Thu, Apr 2, 2009 at 9:12 AM, stack <[email protected]> wrote: > > > How many columns will you have? Until we fix > > https://issues.apache.org/jira/browse/HBASE-867, you are limited regards > > the number of columns you can have. > > St.Ack > > > > > > On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross < > [email protected] > > > wrote: > > > >> Based on reading the hbase architecture wiki, I have changed my thinking > >> due > >> to the "Column Family Centric Storage." > >> > >> HBase stores column families physically close on disk, so the items in a > >> given column family should have roughly the same read/write > >> characteristics > >> and contain similar data. Although at a conceptual level, tables may be > >> viewed as a sparse set of rows, physically they are stored on a > per-column > >> family basis. This is an important consideration for schema and > >> application > >> designers to keep in mind. > >> > >> This leads me to the thought of keeping an entire time series inside a > >> single column family. > >> > >> Options: > >> > >> Row key is a ticker symbol: > >> - hijack time stamp to be the time of each observation. Use a column > >> family > >> to hold all the data, and a column for each property of each > observation. > >> -don't hijack the time stamp, just ignore it. Use a column family for > all > >> the data, and use an individual column for the date/time of the > >> observation, > >> and individual columns for each property of each observation. > >> > >> thoughts? > >> > >> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross > >> <[email protected]>wrote: > >> > >> > Greetings, > >> > > >> > I am prototyping a financial time series database on top of HBase and > >> > trying to head my head around what a good design would look like. > >> > > >> > As I understand it, I have rows, column families, columns and cells. > >> > > >> > Since the only think that Hbase really "indexes" is row keys, it seems > >> > natural in a way to represent the rowkeys as the date/time. > >> > > >> > As a simple example: > >> > > >> > Bar data: > >> > > >> > { > >> > "2009/1/17" : { > >> > "open":"100", > >> > "high":"102", > >> > "low":"99", > >> > "close":"101" > >> > "volume":"1000256" > >> > } > >> > } > >> > > >> > > >> > Quote data: > >> > > >> > { > >> > "2009/1/17:11:23:04" : { > >> > "bid":"100.01", > >> > "ask":"100.02", > >> > "bidsize":"10000", > >> > "asksize":"100200" > >> > } > >> > } > >> > > >> > But there are many other issues to think about. > >> > > >> > In financial time series data we have small amounts of data within > each > >> > "observation" and we can have lots of observations. We can have > >> millions of > >> > observations per time series (f.ex. all historical trade and quote > date > >> for > >> > a particular stock since 1993)across hundreds of thousands of > individual > >> > instruments (f.ex. across all stocks that have traded since 1993.) > >> > > >> > The write patterns fit HBase nicely, because it is a write once and > >> append > >> > pattern. This is followed by loads of offline processes for > simulating > >> > trading models and such. These query patterns look like "all quotes > for > >> all > >> > stocks between the dates of 1/1/996 and 12/31/2008." So the querying > is > >> > typically across a date range, and we can further filter the query by > >> > instrument types. > >> > > >> > So I am not sure what makes sense for efficiency because I do not > >> > understand HBase well enough yet. > >> > > >> > What kinds of mixes of rows, column families, and columns should I be > >> > thinking about? > >> > > >> > Does my simplistic approach make any sense? That would mean each row > is > >> a > >> > key-value pair where the key is is the date/time and the value is the > >> > "observation." I suppose this leads to a "table per time series" > model. > >> > Does that make sense or is there overhead to having lots of tables? > >> > > >> > > > > >
