HBase Read Performance - Multiget vs TableInputFormat Job

Jon Bender Sun, 05 Feb 2012 20:57:02 -0800

Hi,

I've got a question about batch read performance in HBase.  I've got a
nightly job that extracts HBase data (currently upwards of ~300k new rows)
added from the previous day.  The rows are spread out fairly evenly over
the key range, so inevitably we will have to read from most, if not all
regions, to retrieve this data, and these reads will not be sequential
across rows.


The two alternatives I am exploring are

   1. Running a TableInputFormat MR job that filters for data added in the
   past day (Scan on the internal timestamp range of the cells)
   2. Using a batched get (multiGet) with a list of the rows were written
   the previous day, most likely using a number of HBase client processes to
   read this data out in parallel.

Does anyone have any recommendations on which approach to take?  I haven't
used the new MultiGet operations so I figured I'd ask the pros before
diving in.

Cheers,
Jon

HBase Read Performance - Multiget vs TableInputFormat Job

Reply via email to