Hey there, I've been working with Hadoop for about a year now, and have recently been tasked with our new metadata storange and analysis platform. I'm looking for your advice into what I should research, and if HBase is right for our use cases.
Currently, we're collecting documents onto our Hadoop cluster, and then indexing them with Lucene (and Katta). Documents have attributes like a create date, author, bodytext, domain, etc. We're looking at 20TB of data to start with, growing by a few dozen a day. I'm researching the best way to provide BI on top of this data that our customers can "Slice and Dice" on. HBase has some appealing characteristics, but I'm not sure if it's *quite* what we need, since latency is an issue. Lucene has great indexing, but we're also going to be adding metadata constantly and performing schema changes. Here's a use case: A customer searches for a keyword in our web UI and a list of a few hundred thousand documents is returned. The customer would then like to select a few random authors from those documents for a certain date range (let's say 4 months), and get a count of documents per author. A few hours later, these documents are tagged with some more metadata... say, PageRank of the parent domain. The user can use this data as part of his queries as well. We'd like to have a response time of 10 seconds or so. I don't care much about storage space, so denormalization is totally fine. Is this a problem we can tackle in HBase or another open source distributed DB? A company called "Vertica" claims to be able to do this, but I wasn't very impressed with their architecture. "Greenplum" also looks interesting, but I haven't researched them much yet. Thanks for all your help!
