If you are only doing 1k-100k record set analytics, would it be
feasible to use the HBase client directly, perform a filtered scan and
do the analytics in memory using Java Collections?  It depends on the
number of dimensions you need but 100k rows of a few Integers is not
absurd to hold in memory, especially if you use some of the optimized
Java collection classes.  A colleague of mine did some research into
Collection implementations and blogged some findings
http://gbif.blogspot.com/2009/06/profiling-memory-usage-of-various.html
We ended up setting on the Trove implementation and I quite regularly
hold 2 million record HashMaps with the likes of -Xmx4G to -Xmx8g
depending on what I am doing.

Just an idea,
Tim


On Sun, Sep 19, 2010 at 4:19 AM, jason <urg...@gmail.com> wrote:
> Hi All,
>
> I am very new to Hadoop and Hbase and here is my first question as I
> was not able to get definite answer from various sources.
>
> Could you help me understand  if what I am planning to do is going to
> fly with Hbase?
>
> Let's say I have a table with around 100M rows (in future may get 1B+)
> and 1k+ columns.
> Each row represents a text document and columns are doc's term
> frequencies (that is String column id and Integer value).
> Each document is identified by its natural document id (also a String).
>
> When a user submits a set of random doc ids (with the set size
> anywhere between 1k and 100k), the system will run a MapReduce job to
> further analyze frequency information on a given subset (usual data
> mining stuff)
>
> My understanding is that MapReduce is designed to perform a full table scan.
> But I just have a concern that doing it through 100M rows (potentially
> 1B+) every time to analyze user's much smaller subset is not really
> efficient...
> But what are my options?
> Should I use scanner at all? Will it be faster with more region
> servers thrown into the cluster?
> Do I have to preprocess input row ids and dispatch them manually to
> region servers or can Hbase do that for me?
>

Reply via email to