Hello, We've got a data warehouse application exposed to a web front end. We're a DB backed site evaluating HBase. We have a typical OLTP type application and run a lot of ETL to load a data warehouse that servers our front end. We've outgrown the hardware and having massive scaling issues. After evaluating HBase we've discovered some thing and wanted to get some input. I'm sure others have and perhaps there are other technologies or idea for solving these problems.
First, the mapred function of Hadoop will solve all our ETL issues and allow that to scale. Appears we can insert our data in a more unstructured manner to save space and loading looks okay. We're primarily now focusing in on fast website access. HBase seems to be able to serve up requests quite quickly now, however they are very basic access requests and I think for anything more complex we're stuck. For example - a simple DW type query would be to generate a top-N-result of some fact data over a variable date range ordered in descending order. In DB terms, this means query across the date range, sum up the fact data and group by the desired keys, then sort the results. On our database, these queries are not scaling as both disk access and sort time are too long. The front-end is suffering. Our hopes were to distribute this, but that's not really what HBase does. Ideally, if we created map-reduce jobs to generate the results of these queries and pre-store them, we could then retrieve them very quickly. However, this is similar to a materialized view and we have the same issue on a database. For every date range and every combination of dimension, storing every result grows exponentially and becomes impossible. So, we were looking at ideas to allow HBase to distribute the query. A map-reduce job is ideal, but far too slow for a real-time web request. It's almost like we need a lighter version of that. For example, we just want to return a small amount of data. If the job runs into issues, who cares, the user can resubmit. The objective is low latency. Thus, we need to make sure we spread around the work. In HBase, if we are hitting a key, it's all stored together. We've actually started structuring our keys in such a way to hash them out across a wide range, so data is spread out. We've been testing have many threads perform a lookup to see the range to query then all hit at once. The issues still become - the only work the nodes are doing is retrieving the keys. They are doing no sorting, no pruning and it's all the client to take the data and merge and sort. This is inefficient as it may need to pull millions of rows to simply get the top 10. Ideally it would be better to do this on the nodes. Obviously we're shoe-horning some of what HBase does to fit our needs. But, just curious as to how others have solved this issue. If there's any work anywhere or possible future versions which may address some of these distributed queries. Thanks -- View this message in context: http://www.nabble.com/Evaluation-of-HBase-Hadoop-and-some-fundamental-questions-tp21602482p21602482.html Sent from the HBase User mailing list archive at Nabble.com.
