Evaluation of HBase/Hadoop and some fundamental questions

ChristopherC Thu, 22 Jan 2009 03:41:10 -0800

Hello,

We've got a data warehouse application exposed to a web front end. We're a
DB backed site evaluating HBase. We have a typical OLTP type application and
run a lot of ETL to load a data warehouse that servers our front end. We've
outgrown the hardware and having massive scaling issues. After evaluating
HBase we've discovered some thing and wanted to get some input. I'm sure
others have and perhaps there are other technologies or idea for solving
these problems.


First, the mapred function of Hadoop will solve all our ETL issues and allow
that to scale. Appears we can insert our data in a more unstructured manner
to save space and loading looks okay. We're primarily now focusing in on
fast website access. HBase seems to be able to serve up requests quite
quickly now, however they are very basic access requests and I think for
anything more complex we're stuck.

For example - a simple DW type query would be to generate a top-N-result of
some fact data over a variable date range ordered in descending order. In DB
terms, this means query across the date range, sum up the fact data and
group by the desired keys, then sort the results. On our database, these
queries are not scaling as both disk access and sort time are too long. The
front-end is suffering. Our hopes were to distribute this, but that's not
really what HBase does. Ideally, if we created map-reduce jobs to generate
the results of these queries and pre-store them, we could then retrieve them
very quickly. However, this is similar to a materialized view and we have
the same issue on a database. For every date range and every combination of
dimension, storing every result grows exponentially and becomes impossible.

So, we were looking at ideas to allow HBase to distribute the query. A
map-reduce job is ideal, but far too slow for a real-time web request. It's
almost like we need a lighter version of that. For example, we just want to
return a small amount of data. If the job runs into issues, who cares, the
user can resubmit. The objective is low latency. Thus, we need to make sure
we spread around the work. In HBase, if we are hitting a key, it's all
stored together. We've actually started structuring our keys in such a way
to hash them out across a wide range, so data is spread out. We've been
testing have many threads perform a lookup to see the range to query then
all hit at once. The issues still become - the only work the nodes are doing
is retrieving the keys. They are doing no sorting, no pruning and it's all
the client to take the data and merge and sort. This is inefficient as it
may need to pull millions of rows to simply get the top 10. Ideally it would
be better to do this on the nodes.

Obviously we're shoe-horning some of what HBase does to fit our needs. But,
just curious as to how others have solved this issue. If there's any work
anywhere or possible future versions which may address some of these
distributed queries.

Thanks
-- 
View this message in context: 
http://www.nabble.com/Evaluation-of-HBase-Hadoop-and-some-fundamental-questions-tp21602482p21602482.html
Sent from the HBase User mailing list archive at Nabble.com.

Evaluation of HBase/Hadoop and some fundamental questions

Reply via email to