Yes, it seems that the fundamental 'differentness' of HDFS/MapReduce is that they're not very well suited to random access -- I was hoping HBase had found a way 'around' that, but of course that 'differentness' is a fundamental strength of the HDFS way of doing things.
Where things have gotten murky is that our data is very simple -- we just have a lot of it. And we don't need to do a *lot* of random access to our data -- it really doesn't feel like an RDBMS situation. Perhaps if we made an index out of a hash of each of our data values, and did some 'normalization', that could be the key. Or maybe the metadata is not going to be as large as I thought... hrm. I appreciate the input, and hope more people will chime in :) On Wed, Feb 25, 2009 at 10:18 PM, Ryan Rawson <[email protected]> wrote: > Hey, > > You have to be clear about what hbase does and does not do. HBase is just > not a rational database - it's "weakness" is it's strength. > > In general, you can only access rows in key order. Keys are stored > lexicographically sorted however. There aren't declarative secondary > indexes (minus the lucene thing, but that isn't an index). You have to put > all these pieces together to build a system. > > But, you get scalability, and reasonable performance, and in 0.20 you will > get really good performance (fast enough to serve websites hopefully). > > In general you need to make sure your row-key sorts data in the order you > want to query by. You can do something like this: > > <user> <Long.MAX_VALUE - System.currentTimeMillis()> <event id> > > to store events in reverse chronological order by users. > > If you want another access method, you need to use a map-reduce and build a > secondary index. > > I dont know if this exactly answers your question, but hopefully should > give > you more of an idea of what hbase does and does not do. > > -ryan > > > > > > On Wed, Feb 25, 2009 at 9:02 PM, Bradford Stephens < > [email protected]> wrote: > > > Greetings, > > > > I'm in charge of the data analysis and collection platform at my company, > > and we're basing a large part of our core analysis platform on Hadoop, > > Nutch, and Lucene -- it's a delight to use. However, we're going to be > > wanting some on-demand "web-scale" business intelligence, and I'm > wondering > > if HBase is the right solution -- my research hasn't given me any > > conclusions. > > > > Our data set is pretty simple -- a bunch of XML documents which have been > > parsed from HTML pages, and some associated data (Author Name, Post Date, > > Influence, etc). What we would like to be able to do is have our end > users > > do real-time (< 10 seconds) OLAP-type analysis on this, and have it > > presented on a webpage. For example, queries like ("All authors for the > > past > > two weeks who have used these keywords in the post bodies and what their > > influence score is"). I imagine we'll have several terabytes of data to > go > > through, and we won't be able to do much pre-population of results. > > > > Is HBase low-latency enough that we can scale-out to solve these sorts of > > problems? > > > > Cheers, > > Bradford > > >
