Re: financial time series database

stack Thu, 02 Apr 2009 00:26:27 -0700

I should also state that apart from the hbase inadequacy, your schema looks
good (hbase should be able to carry this schema-type w/o sweat -- hopefully
0.20.0).
St.Ack


On Thu, Apr 2, 2009 at 9:12 AM, stack <[email protected]> wrote:

> How many columns will you have?  Until we fix
> https://issues.apache.org/jira/browse/HBASE-867, you are limited regards
> the number of columns you can have.
> St.Ack
>
>
> On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <[email protected]
> > wrote:
>
>> Based on reading the hbase architecture wiki, I have changed my thinking
>> due
>> to the "Column Family Centric Storage."
>>
>> HBase stores column families physically close on disk, so the items in a
>> given column family should have roughly the same read/write
>> characteristics
>> and contain similar data.  Although at a conceptual level, tables may be
>> viewed as a sparse set of rows, physically they are stored on a per-column
>> family basis. This is an important consideration for schema and
>> application
>> designers to keep in mind.
>>
>> This leads me to the thought of keeping an entire time series inside a
>> single column family.
>>
>> Options:
>>
>> Row key is a ticker symbol:
>> - hijack time stamp to be the time of each observation.  Use a column
>> family
>> to hold all the data, and a column for each property of  each observation.
>> -don't hijack the time stamp, just ignore it.  Use a column family for all
>> the data, and use an individual column for the date/time of the
>> observation,
>> and individual columns for each property of each observation.
>>
>> thoughts?
>>
>> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
>> <[email protected]>wrote:
>>
>> > Greetings,
>> >
>> > I am prototyping a financial time series database on top of HBase and
>> > trying to head my head around what a good design would look like.
>> >
>> > As I understand it, I have rows, column families, columns and cells.
>> >
>> > Since the only think that Hbase really "indexes" is row keys, it seems
>> > natural in a way to represent the rowkeys as the date/time.
>> >
>> > As a simple example:
>> >
>> > Bar data:
>> >
>> > {
>> >    "2009/1/17" : {
>> >      "open":"100",
>> >      "high":"102",
>> >      "low":"99",
>> >      "close":"101"
>> >      "volume":"1000256"
>> >    }
>> > }
>> >
>> >
>> > Quote data:
>> >
>> > {
>> >    "2009/1/17:11:23:04" : {
>> >      "bid":"100.01",
>> >      "ask":"100.02",
>> >      "bidsize":"10000",
>> >      "asksize":"100200"
>> >    }
>> > }
>> >
>> > But there are many other issues to think about.
>> >
>> > In financial time series data we have small amounts of data within each
>> > "observation" and we can have lots of observations.  We can have
>> millions of
>> > observations per time series (f.ex. all historical trade and quote date
>> for
>> > a particular stock since 1993)across hundreds of thousands of individual
>> > instruments (f.ex. across all stocks that have traded since 1993.)
>> >
>> > The write patterns fit HBase nicely, because it is a write once and
>> append
>> > pattern.  This is followed by loads of offline processes for simulating
>> > trading models and such.  These query patterns look like "all quotes for
>> all
>> > stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
>> > typically across a date range, and we can further filter the query by
>> > instrument types.
>> >
>> > So I am not sure what makes sense for efficiency because I do not
>> > understand HBase well enough yet.
>> >
>> >  What kinds of mixes of rows, column families, and columns should I be
>> > thinking about?
>> >
>> > Does my simplistic approach make any sense?  That would mean each row is
>> a
>> > key-value pair where the key is is the date/time and the value is the
>> > "observation."  I suppose this leads to a "table per time series" model.
>> > Does that make sense or is there overhead to having lots of tables?
>> >
>>
>
>

Re: financial time series database

Reply via email to