On Sun, Feb 5, 2012 at 8:56 PM, Jon Bender <[email protected]> wrote:
> The two alternatives I am exploring are
>
>   1. Running a TableInputFormat MR job that filters for data added in the
>   past day (Scan on the internal timestamp range of the cells)

You'll touch all your data when you do this.

What percentage of total data is the 300k new rows?

>   2. Using a batched get (multiGet) with a list of the rows were written
>   the previous day, most likely using a number of HBase client processes to
>   read this data out in parallel.
>

If you have the list of the 300k, this could work.  You could write a
mapreduce job that divided the 300k into maps and in each mapper run a
client to do  multiget (it'll sort the gets by regions for you).

St.Ack

Reply via email to