Bradford,

Many of us probably have some input but it's really difficult to help
without having more detail.

Can you be more specific about the layout of the data and the queries you'd
want to run?

HBase is efficient at scanning (as with hdfs), but also efficient at
randomly accessing by row key.  If you need to fetch based on column names
or values, then hbase will not be efficient without some form of secondary
indexing (additional tables in hbase or something external like lucene).

JG 

> -----Original Message-----
> From: Bradford Stephens [mailto:[email protected]]
> Sent: Thursday, February 26, 2009 10:37 AM
> To: [email protected]
> Subject: Re: HBase and Web-Scale BI
> 
> Yes, it seems that the fundamental 'differentness' of HDFS/MapReduce is
> that
> they're not very well suited to random access -- I was hoping HBase had
> found a way 'around' that, but of course that 'differentness' is a
> fundamental strength of the HDFS way of doing things.
> 
> Where things have gotten murky is that our data is very simple -- we
> just
> have a lot of it. And we don't need to do a *lot* of random access to
> our
> data -- it really doesn't feel like an RDBMS situation.
> 
> Perhaps if we made an index out of a hash of each of our data values,
> and
> did some 'normalization',  that could be the key. Or maybe the metadata
> is
> not going to be as large as I thought... hrm.
> 
> I appreciate the input, and hope more people will chime in :)
> 
> On Wed, Feb 25, 2009 at 10:18 PM, Ryan Rawson <[email protected]>
> wrote:
> 
> > Hey,
> >
> > You have to be clear about what hbase does and does not do.  HBase is
> just
> > not a rational database - it's "weakness" is it's strength.
> >
> > In general, you can only access rows in key order.  Keys are stored
> > lexicographically sorted however.  There aren't declarative secondary
> > indexes (minus the lucene thing, but that isn't an index).  You have
> to put
> > all these pieces together to build a system.
> >
> > But, you get scalability, and reasonable performance, and in 0.20 you
> will
> > get really good performance (fast enough to serve websites
> hopefully).
> >
> > In general you need to make sure your row-key sorts data in the order
> you
> > want to query by.  You can do something like this:
> >
> > <user> <Long.MAX_VALUE - System.currentTimeMillis()> <event id>
> >
> > to store events in reverse chronological order by users.
> >
> > If you want another access method, you need to use a map-reduce and
> build a
> > secondary index.
> >
> > I dont know if this exactly answers your question, but hopefully
> should
> > give
> > you more of an idea of what hbase does and does not do.
> >
> > -ryan
> >
> >
> >
> >
> >
> > On Wed, Feb 25, 2009 at 9:02 PM, Bradford Stephens <
> > [email protected]> wrote:
> >
> > > Greetings,
> > >
> > > I'm in charge of the data analysis and collection platform at my
> company,
> > > and we're basing a large part of our core analysis platform on
> Hadoop,
> > > Nutch, and Lucene -- it's a delight to use. However, we're going to
> be
> > > wanting some on-demand "web-scale" business intelligence, and I'm
> > wondering
> > > if HBase is the right solution -- my research hasn't given me any
> > > conclusions.
> > >
> > > Our data set is pretty simple -- a bunch of XML documents which
> have been
> > > parsed from HTML pages, and some associated data (Author Name, Post
> Date,
> > > Influence, etc). What we would like to be able to do is have our
> end
> > users
> > > do real-time (< 10 seconds) OLAP-type analysis on this, and have it
> > > presented on a webpage. For example, queries like ("All authors for
> the
> > > past
> > > two weeks who have used these keywords in the post bodies and what
> their
> > > influence score is"). I imagine we'll have several terabytes of
> data to
> > go
> > > through, and we won't be able to do much pre-population of results.
> > >
> > > Is HBase low-latency enough that we can scale-out to solve these
> sorts of
> > > problems?
> > >
> > > Cheers,
> > > Bradford
> > >
> >

Reply via email to