Are there technical limitations to the number of different timestamps per cell? If it's the case that you're doing to be dealing with tens of thousands to millions of entries all at one cell, perhaps you should check that to make sure it's a reasonable use case. The examples in the HBase docs number the timestamps in single digits, and I don't recall any mention of very large numbers.

An alternative layout might be to append the timestamp to the instrument for the row key. So you might have something like:

YHOO.20090402154723012345 -> ...
YHOO.20090402154723023456 -> ...

This way, if you're appending to the database in the order your quotes come in, you aren't hitting a hotspot previously mentioned. You also get to index into the table by instrument if you need to. The downside here is that if you have to read in data for *all* instruments for a specific day, there doesn't seem to be a trivial way of accomplishing that. You could, of course, maintain a separate database that tells you what the entire universe of instruments is per day.

You could also use a hybrid approach, where perhaps you have a row key that matches a single day, like YHOO.20090402, and then have X number of cells with timestamps according to the time at which that quote came in on that day.

At any rate -- if your access pattern is structured predictably (like, primarily reading straight through and picked by coarse grain properties, such as instrument and day), you might be better served storing the files directly in HDFS and just not bothering with HBase at all.


Wes


On Apr 2, 2009, at 11:41 AM, Bradford Cross wrote:

Cool, so the schema I am leaning toward is:

-hijack time stamp to be the time of each observation. Use a column family to hold all the data, and a column for each property of each observation.

Since HBase sorts the timestamps descending, it seems like hijacking the timestamps makes sense. Any performance implications of this that I should
be aware of?

Hijacking the time stamps seems to be fairly intuitive, and leverages the time stamps which I otherwise would not really care about if I just ignored timestamps and dumped all data including the date/time of observations into
columns.

Are there any downsides to hijacking the timestamps like this?



On Thu, Apr 2, 2009 at 12:13 AM, stack <[email protected]> wrote:

I should also state that apart from the hbase inadequacy, your schema looks good (hbase should be able to carry this schema-type w/o sweat -- hopefully
0.20.0).
St.Ack

On Thu, Apr 2, 2009 at 9:12 AM, stack <[email protected]> wrote:

How many columns will you have?  Until we fix
https://issues.apache.org/jira/browse/HBASE-867, you are limited regards
the number of columns you can have.
St.Ack


On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <
[email protected]
wrote:

Based on reading the hbase architecture wiki, I have changed my thinking
due
to the "Column Family Centric Storage."

HBase stores column families physically close on disk, so the items in a
given column family should have roughly the same read/write
characteristics
and contain similar data. Although at a conceptual level, tables may be
viewed as a sparse set of rows, physically they are stored on a
per-column
family basis. This is an important consideration for schema and
application
designers to keep in mind.

This leads me to the thought of keeping an entire time series inside a
single column family.

Options:

Row key is a ticker symbol:
- hijack time stamp to be the time of each observation. Use a column
family
to hold all the data, and a column for each property of  each
observation.
-don't hijack the time stamp, just ignore it. Use a column family for
all
the data, and use an individual column for the date/time of the
observation,
and individual columns for each property of each observation.

thoughts?

On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
<[email protected]>wrote:

Greetings,

I am prototyping a financial time series database on top of HBase and
trying to head my head around what a good design would look like.

As I understand it, I have rows, column families, columns and cells.

Since the only think that Hbase really "indexes" is row keys, it seems
natural in a way to represent the rowkeys as the date/time.

As a simple example:

Bar data:

{
  "2009/1/17" : {
    "open":"100",
    "high":"102",
    "low":"99",
    "close":"101"
    "volume":"1000256"
  }
}


Quote data:

{
  "2009/1/17:11:23:04" : {
    "bid":"100.01",
    "ask":"100.02",
    "bidsize":"10000",
    "asksize":"100200"
  }
}

But there are many other issues to think about.

In financial time series data we have small amounts of data within
each
"observation" and we can have lots of observations.  We can have
millions of
observations per time series (f.ex. all historical trade and quote
date
for
a particular stock since 1993)across hundreds of thousands of
individual
instruments (f.ex. across all stocks that have traded since 1993.)

The write patterns fit HBase nicely, because it is a write once and
append
pattern.  This is followed by loads of offline processes for
simulating
trading models and such. These query patterns look like "all quotes
for
all
stocks between the dates of 1/1/996 and 12/31/2008." So the querying
is
typically across a date range, and we can further filter the query by
instrument types.

So I am not sure what makes sense for efficiency because I do not
understand HBase well enough yet.

What kinds of mixes of rows, column families, and columns should I be
thinking about?

Does my simplistic approach make any sense? That would mean each row
is
a
key-value pair where the key is is the date/time and the value is the
"observation."  I suppose this leads to a "table per time series"
model.
Does that make sense or is there overhead to having lots of tables?






Reply via email to