Hey Luke,

A couple comments inline below:

On Tue, Aug 3, 2010 at 8:40 AM, Luke Forehand <
luke.foreh...@networkedinsights.com> wrote:

> Thanks to the help of people on this mailing list and Cloudera, our team
> has
> managed to get our 3 data node cluster with HBase running like a top.  Our
> import rate is now around 3 GB per job which takes about 10 minutes.  This
> is
> great.  Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full
> table
> scan of ~150 million records takes ~1 hour.  This won't work for our use
> case,
> because not only are we continuing to add more data to this table, but we
> are
> asking many more questions in a day.  To increase performance, the first
> thought
> was to use a secondary index table, and do range scans of the secondary
> index
> table, iteratively performing GET operations of the master table.
>
> In testing the average GET operation took 37 milliseconds.  At that rate
> with 24
> mappers it would take ~1.5 hours to scan 3 million rows.  This still seems
> like
> a lot of time.  37 milliseconds per GET is nice for "real time" access from
> a
> client, but not during massive GETs of data in a map reduce job.
>
>
The above is true if you assume you can only do one get at a time. In fact,
you can probably pipeline gets, and there's actually a patch in the works
for multiget support - HBASE-1845. I don't think it's being actively worked
on at the moment, though, so you'll have to do it somewhat manually. I'd
recommend using multithreading in each mapper so that the keys come off the
scan into a small thread pool executor which performs the gets - this should
get you some parallelism. Otherwise you'll find the mappers are mostly
spending time waiting on the network and not doing work.


> My question is, does it make sense to use secondary index tables in a map
> reduce
> job of this scale?  Should we not be using HBase for input in these map
> reduce
> jobs and go with raw SequenceFile?  Do we simply need more nodes?
>
>
It highly depends on the selectivity - if you're able to cut out a very
large percentage of the records using your secondary index, then you'll be
saving time for sure. If not, then you've just turned your sequential IO
(read: fast) into random IO (read: slow). It's better to do a few random IOs
than a lot of sequential, but better to do a lot of sequential than a lot of
random, if that makes any sense.

One thing that no one has raised yet is whether you're using the Filter API.
If you're not already using Filters to apply a server side predicate, I'd
recommend looking into it. This will allow you to reduce the amount of
network traffic between the mappers and the region servers, and should
improve performance noticeably.

In the long run, I have some plans to implement bitmap indexes in HBase,
which would provide for very fast filter predicates during scans. It's not
on the road map for the next several months, unfortunately - probably
post-0.90.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to