On Sun, Feb 5, 2012 at 8:56 PM, Jon Bender <[email protected]> wrote: > The two alternatives I am exploring are > > 1. Running a TableInputFormat MR job that filters for data added in the > past day (Scan on the internal timestamp range of the cells)
You'll touch all your data when you do this. What percentage of total data is the 300k new rows? > 2. Using a batched get (multiGet) with a list of the rows were written > the previous day, most likely using a number of HBase client processes to > read this data out in parallel. > If you have the list of the 300k, this could work. You could write a mapreduce job that divided the 300k into maps and in each mapper run a client to do multiget (it'll sort the gets by regions for you). St.Ack
