Hi, I've got a question about batch read performance in HBase. I've got a nightly job that extracts HBase data (currently upwards of ~300k new rows) added from the previous day. The rows are spread out fairly evenly over the key range, so inevitably we will have to read from most, if not all regions, to retrieve this data, and these reads will not be sequential across rows.
The two alternatives I am exploring are 1. Running a TableInputFormat MR job that filters for data added in the past day (Scan on the internal timestamp range of the cells) 2. Using a batched get (multiGet) with a list of the rows were written the previous day, most likely using a number of HBase client processes to read this data out in parallel. Does anyone have any recommendations on which approach to take? I haven't used the new MultiGet operations so I figured I'd ask the pros before diving in. Cheers, Jon
