Re: HBase and Web-Scale BI

Bradford Stephens Thu, 26 Feb 2009 10:37:51 -0800

Yes, it seems that the fundamental 'differentness' of HDFS/MapReduce is that
they're not very well suited to random access -- I was hoping HBase had
found a way 'around' that, but of course that 'differentness' is a
fundamental strength of the HDFS way of doing things.


Where things have gotten murky is that our data is very simple -- we just
have a lot of it. And we don't need to do a *lot* of random access to our
data -- it really doesn't feel like an RDBMS situation.

Perhaps if we made an index out of a hash of each of our data values, and
did some 'normalization',  that could be the key. Or maybe the metadata is
not going to be as large as I thought... hrm.

I appreciate the input, and hope more people will chime in :)

On Wed, Feb 25, 2009 at 10:18 PM, Ryan Rawson <[email protected]> wrote:

> Hey,
>
> You have to be clear about what hbase does and does not do.  HBase is just
> not a rational database - it's "weakness" is it's strength.
>
> In general, you can only access rows in key order.  Keys are stored
> lexicographically sorted however.  There aren't declarative secondary
> indexes (minus the lucene thing, but that isn't an index).  You have to put
> all these pieces together to build a system.
>
> But, you get scalability, and reasonable performance, and in 0.20 you will
> get really good performance (fast enough to serve websites hopefully).
>
> In general you need to make sure your row-key sorts data in the order you
> want to query by.  You can do something like this:
>
> <user> <Long.MAX_VALUE - System.currentTimeMillis()> <event id>
>
> to store events in reverse chronological order by users.
>
> If you want another access method, you need to use a map-reduce and build a
> secondary index.
>
> I dont know if this exactly answers your question, but hopefully should
> give
> you more of an idea of what hbase does and does not do.
>
> -ryan
>
>
>
>
>
> On Wed, Feb 25, 2009 at 9:02 PM, Bradford Stephens <
> [email protected]> wrote:
>
> > Greetings,
> >
> > I'm in charge of the data analysis and collection platform at my company,
> > and we're basing a large part of our core analysis platform on Hadoop,
> > Nutch, and Lucene -- it's a delight to use. However, we're going to be
> > wanting some on-demand "web-scale" business intelligence, and I'm
> wondering
> > if HBase is the right solution -- my research hasn't given me any
> > conclusions.
> >
> > Our data set is pretty simple -- a bunch of XML documents which have been
> > parsed from HTML pages, and some associated data (Author Name, Post Date,
> > Influence, etc). What we would like to be able to do is have our end
> users
> > do real-time (< 10 seconds) OLAP-type analysis on this, and have it
> > presented on a webpage. For example, queries like ("All authors for the
> > past
> > two weeks who have used these keywords in the post bodies and what their
> > influence score is"). I imagine we'll have several terabytes of data to
> go
> > through, and we won't be able to do much pre-population of results.
> >
> > Is HBase low-latency enough that we can scale-out to solve these sorts of
> > problems?
> >
> > Cheers,
> > Bradford
> >
>

Re: HBase and Web-Scale BI

Reply via email to