Are there technical limitations to the number of different timestamps
per cell? If it's the case that you're doing to be dealing with tens
of thousands to millions of entries all at one cell, perhaps you
should check that to make sure it's a reasonable use case. The
examples in the HBase docs number the timestamps in single digits, and
I don't recall any mention of very large numbers.
An alternative layout might be to append the timestamp to the
instrument for the row key. So you might have something like:
YHOO.20090402154723012345 -> ...
YHOO.20090402154723023456 -> ...
This way, if you're appending to the database in the order your quotes
come in, you aren't hitting a hotspot previously mentioned. You also
get to index into the table by instrument if you need to. The downside
here is that if you have to read in data for *all* instruments for a
specific day, there doesn't seem to be a trivial way of accomplishing
that. You could, of course, maintain a separate database that tells
you what the entire universe of instruments is per day.
You could also use a hybrid approach, where perhaps you have a row key
that matches a single day, like YHOO.20090402, and then have X number
of cells with timestamps according to the time at which that quote
came in on that day.
At any rate -- if your access pattern is structured predictably (like,
primarily reading straight through and picked by coarse grain
properties, such as instrument and day), you might be better served
storing the files directly in HDFS and just not bothering with HBase
at all.
Wes
On Apr 2, 2009, at 11:41 AM, Bradford Cross wrote:
Cool, so the schema I am leaning toward is:
-hijack time stamp to be the time of each observation. Use a column
family
to hold all the data, and a column for each property of each
observation.
Since HBase sorts the timestamps descending, it seems like hijacking
the
timestamps makes sense. Any performance implications of this that I
should
be aware of?
Hijacking the time stamps seems to be fairly intuitive, and
leverages the
time stamps which I otherwise would not really care about if I just
ignored
timestamps and dumped all data including the date/time of
observations into
columns.
Are there any downsides to hijacking the timestamps like this?
On Thu, Apr 2, 2009 at 12:13 AM, stack <[email protected]> wrote:
I should also state that apart from the hbase inadequacy, your
schema looks
good (hbase should be able to carry this schema-type w/o sweat --
hopefully
0.20.0).
St.Ack
On Thu, Apr 2, 2009 at 9:12 AM, stack <[email protected]> wrote:
How many columns will you have? Until we fix
https://issues.apache.org/jira/browse/HBASE-867, you are limited
regards
the number of columns you can have.
St.Ack
On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <
[email protected]
wrote:
Based on reading the hbase architecture wiki, I have changed my
thinking
due
to the "Column Family Centric Storage."
HBase stores column families physically close on disk, so the
items in a
given column family should have roughly the same read/write
characteristics
and contain similar data. Although at a conceptual level, tables
may be
viewed as a sparse set of rows, physically they are stored on a
per-column
family basis. This is an important consideration for schema and
application
designers to keep in mind.
This leads me to the thought of keeping an entire time series
inside a
single column family.
Options:
Row key is a ticker symbol:
- hijack time stamp to be the time of each observation. Use a
column
family
to hold all the data, and a column for each property of each
observation.
-don't hijack the time stamp, just ignore it. Use a column
family for
all
the data, and use an individual column for the date/time of the
observation,
and individual columns for each property of each observation.
thoughts?
On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
<[email protected]>wrote:
Greetings,
I am prototyping a financial time series database on top of
HBase and
trying to head my head around what a good design would look like.
As I understand it, I have rows, column families, columns and
cells.
Since the only think that Hbase really "indexes" is row keys, it
seems
natural in a way to represent the rowkeys as the date/time.
As a simple example:
Bar data:
{
"2009/1/17" : {
"open":"100",
"high":"102",
"low":"99",
"close":"101"
"volume":"1000256"
}
}
Quote data:
{
"2009/1/17:11:23:04" : {
"bid":"100.01",
"ask":"100.02",
"bidsize":"10000",
"asksize":"100200"
}
}
But there are many other issues to think about.
In financial time series data we have small amounts of data within
each
"observation" and we can have lots of observations. We can have
millions of
observations per time series (f.ex. all historical trade and quote
date
for
a particular stock since 1993)across hundreds of thousands of
individual
instruments (f.ex. across all stocks that have traded since 1993.)
The write patterns fit HBase nicely, because it is a write once
and
append
pattern. This is followed by loads of offline processes for
simulating
trading models and such. These query patterns look like "all
quotes
for
all
stocks between the dates of 1/1/996 and 12/31/2008." So the
querying
is
typically across a date range, and we can further filter the
query by
instrument types.
So I am not sure what makes sense for efficiency because I do not
understand HBase well enough yet.
What kinds of mixes of rows, column families, and columns should
I be
thinking about?
Does my simplistic approach make any sense? That would mean
each row
is
a
key-value pair where the key is is the date/time and the value
is the
"observation." I suppose this leads to a "table per time series"
model.
Does that make sense or is there overhead to having lots of
tables?