Re: financial time series database

Wesley Chow Thu, 02 Apr 2009 12:54:10 -0700

Are there technical limitations to the number of different timestampsper cell? If it's the case that you're doing to be dealing with tensof thousands to millions of entries all at one cell, perhaps youshould check that to make sure it's a reasonable use case. Theexamples in the HBase docs number the timestamps in single digits, andI don't recall any mention of very large numbers.

An alternative layout might be to append the timestamp to theinstrument for the row key. So you might have something like:


YHOO.20090402154723012345 -> ...
YHOO.20090402154723023456 -> ...

This way, if you're appending to the database in the order your quotescome in, you aren't hitting a hotspot previously mentioned. You alsoget to index into the table by instrument if you need to. The downsidehere is that if you have to read in data for *all* instruments for aspecific day, there doesn't seem to be a trivial way of accomplishingthat. You could, of course, maintain a separate database that tellsyou what the entire universe of instruments is per day.

You could also use a hybrid approach, where perhaps you have a row keythat matches a single day, like YHOO.20090402, and then have X numberof cells with timestamps according to the time at which that quotecame in on that day.

At any rate -- if your access pattern is structured predictably (like,primarily reading straight through and picked by coarse grainproperties, such as instrument and day), you might be better servedstoring the files directly in HDFS and just not bothering with HBaseat all.



Wes


On Apr 2, 2009, at 11:41 AM, Bradford Cross wrote:

Cool, so the schema I am leaning toward is:
-hijack time stamp to be the time of each observation. Use a columnfamilyto hold all the data, and a column for each property of eachobservation.
Since HBase sorts the timestamps descending, it seems like hijackingthetimestamps makes sense. Any performance implications of this that Ishould
be aware of?
Hijacking the time stamps seems to be fairly intuitive, andleverages thetime stamps which I otherwise would not really care about if I justignoredtimestamps and dumped all data including the date/time ofobservations into
columns.

Are there any downsides to hijacking the timestamps like this?



On Thu, Apr 2, 2009 at 12:13 AM, stack <[email protected]> wrote:
I should also state that apart from the hbase inadequacy, yourschema looksgood (hbase should be able to carry this schema-type w/o sweat --hopefully
0.20.0).
St.Ack

On Thu, Apr 2, 2009 at 9:12 AM, stack <[email protected]> wrote:
How many columns will you have?  Until we fix
https://issues.apache.org/jira/browse/HBASE-867, you are limitedregards
the number of columns you can have.
St.Ack


On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <
[email protected]
wrote:
Based on reading the hbase architecture wiki, I have changed mythinking
due
to the "Column Family Centric Storage."
HBase stores column families physically close on disk, so theitems in a
given column family should have roughly the same read/write
characteristics
and contain similar data. Although at a conceptual level, tablesmay be
viewed as a sparse set of rows, physically they are stored on a
per-column
family basis. This is an important consideration for schema and
application
designers to keep in mind.
This leads me to the thought of keeping an entire time seriesinside a
single column family.

Options:

Row key is a ticker symbol:
- hijack time stamp to be the time of each observation. Use acolumn
family
to hold all the data, and a column for each property of  each
observation.
-don't hijack the time stamp, just ignore it. Use a columnfamily for
all
the data, and use an individual column for the date/time of the
observation,
and individual columns for each property of each observation.

thoughts?

On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
<[email protected]>wrote:
Greetings,
I am prototyping a financial time series database on top ofHBase and
trying to head my head around what a good design would look like.
As I understand it, I have rows, column families, columns andcells.
Since the only think that Hbase really "indexes" is row keys, itseems
natural in a way to represent the rowkeys as the date/time.

As a simple example:

Bar data:

{
  "2009/1/17" : {
    "open":"100",
    "high":"102",
    "low":"99",
    "close":"101"
    "volume":"1000256"
  }
}


Quote data:

{
  "2009/1/17:11:23:04" : {
    "bid":"100.01",
    "ask":"100.02",
    "bidsize":"10000",
    "asksize":"100200"
  }
}

But there are many other issues to think about.

In financial time series data we have small amounts of data within
each
"observation" and we can have lots of observations.  We can have
millions of
observations per time series (f.ex. all historical trade and quote
date
for
a particular stock since 1993)across hundreds of thousands of
individual
instruments (f.ex. across all stocks that have traded since 1993.)
The write patterns fit HBase nicely, because it is a write onceand
append
pattern.  This is followed by loads of offline processes for
simulating
trading models and such. These query patterns look like "allquotes
for
all
stocks between the dates of 1/1/996 and 12/31/2008." So thequerying
is
typically across a date range, and we can further filter thequery by
instrument types.

So I am not sure what makes sense for efficiency because I do not
understand HBase well enough yet.
What kinds of mixes of rows, column families, and columns shouldI be
thinking about?
Does my simplistic approach make any sense? That would meaneach row
is
a
key-value pair where the key is is the date/time and the valueis the
"observation."  I suppose this leads to a "table per time series"
model.
Does that make sense or is there overhead to having lots oftables?

Re: financial time series database

Reply via email to