Web BI with HBase : Use Case

dotnetmetal Mon, 11 May 2009 15:57:18 -0700

Hey there,

I've been working with Hadoop for about a year now, and have recently been
tasked with our new metadata storange and analysis platform. I'm looking for
your advice into what I should research, and if HBase is right for our use
cases.


Currently, we're collecting documents onto our Hadoop cluster, and then
indexing them with Lucene (and Katta). Documents have attributes like a
create date, author, bodytext, domain, etc.

We're looking at 20TB of data to start with, growing by a few dozen a day.

I'm researching the best way to provide BI on top of this data that our
customers can "Slice and Dice" on. HBase has some appealing characteristics,
but I'm not sure if it's *quite* what we need, since latency is an issue.
Lucene has great indexing, but we're also going to be adding metadata
constantly and performing schema changes.

Here's a use case:

A customer searches for a keyword in our web UI and a list of a few hundred
thousand documents is returned. The customer would then like to select a few
random authors from those documents for a certain date range (let's say 4
months), and get a count of documents per author. A few hours later, these
documents are tagged with some more metadata... say, PageRank of the parent
domain. The user can use this data as part of his queries as well. We'd like
to have a response time of 10 seconds or so.

I don't care much about storage space, so denormalization is totally fine.
Is this a problem we can tackle in HBase or another open source distributed
DB?

A company called "Vertica" claims to be able to do this, but I wasn't very
impressed with their architecture. "Greenplum" also looks interesting, but I
haven't researched them much yet.

Thanks for all your help!

Web BI with HBase : Use Case

Reply via email to