If you are only doing 1k-100k record set analytics, would it be feasible to use the HBase client directly, perform a filtered scan and do the analytics in memory using Java Collections? It depends on the number of dimensions you need but 100k rows of a few Integers is not absurd to hold in memory, especially if you use some of the optimized Java collection classes. A colleague of mine did some research into Collection implementations and blogged some findings http://gbif.blogspot.com/2009/06/profiling-memory-usage-of-various.html We ended up setting on the Trove implementation and I quite regularly hold 2 million record HashMaps with the likes of -Xmx4G to -Xmx8g depending on what I am doing.
Just an idea, Tim On Sun, Sep 19, 2010 at 4:19 AM, jason <urg...@gmail.com> wrote: > Hi All, > > I am very new to Hadoop and Hbase and here is my first question as I > was not able to get definite answer from various sources. > > Could you help me understand if what I am planning to do is going to > fly with Hbase? > > Let's say I have a table with around 100M rows (in future may get 1B+) > and 1k+ columns. > Each row represents a text document and columns are doc's term > frequencies (that is String column id and Integer value). > Each document is identified by its natural document id (also a String). > > When a user submits a set of random doc ids (with the set size > anywhere between 1k and 100k), the system will run a MapReduce job to > further analyze frequency information on a given subset (usual data > mining stuff) > > My understanding is that MapReduce is designed to perform a full table scan. > But I just have a concern that doing it through 100M rows (potentially > 1B+) every time to analyze user's much smaller subset is not really > efficient... > But what are my options? > Should I use scanner at all? Will it be faster with more region > servers thrown into the cluster? > Do I have to preprocess input row ids and dispatch them manually to > region servers or can Hbase do that for me? >