Looking at the code, it seems possible to do this server side within the multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary...
On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <la...@apache.org> wrote: > I should qualify that statement, actually. > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > returned. > > As James Taylor pointed out to me privately: A fairer comparison would > have been to run a scan with a filter that lets x% of the rows pass (i.e. > the selectivity of the scan would be x%) and compare that to a multi Get of > the same x% of the row. > > There we found that a Scan+Filter is more efficient that issuing multi > Gets if x is >= 1-2%. > > > Or in other words, translating many Gets into a Scan+Filter is beneficial > if the Scan would return at least 1-2% of the rows to the client. > For example: > if you are looking for less than 10-20k rows in 1m rows, using muli Gets > is likely more efficient. > if you are looking for more than 10-20k rows in 1m rows, using a > Scan+Filter is likely more efficient. > > > Of course this is predicated on whether you have an efficient way to > represent the rows you are looking for in a filter, so that would probably > shift this slightly more towards Gets (just imaging a Filter that to encode > 100k random row keys to be matched; since Filters are instantiated store > there is another natural limit there). > > > As I said below, the crux of the matter is having some histograms of your > data, so that such a decision could be made automatically. > > > -- Lars > > > > ________________________________ > From: lars hofhansl <la...@apache.org> > To: "user@hbase.apache.org" <user@hbase.apache.org> > Sent: Monday, February 18, 2013 5:48 PM > Subject: Re: Optimizing Multi Gets in hbase > > As it happens we did some tests around last week. > Turns out doing Gets in batches instead of a scan still gives you 1/3 of > the performance. > > I.e. when you have a table with, say, 10m rows and scanning take N > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty > impressive. > > Now, this is with all data in the cache! > When the data is not in the cache and the Gets are random it is many > orders of magnitude slower, as the Gets are sprayed all over the disk. In > that case sorting the Gets and issuing scans would indeed be much more > efficient. > > > The Gets in a batch are already sorted on the client, but as N. says it is > hard to determine when to turn many Gets into a Scan with filters > automatically. Without statistics/histograms I'd even wager a guess that > would be impossible to do. > Imagine you issue 10000 random Gets, but your table has 10bn rows, in that > case it is almost certain that the Gets are faster than a scan. > Now image the Gets only cover a small key range. With statistics we could > tell whether it would beneficial to turn this into a scan. > > It's not that hard to add statistics to HBase. Would do it as part of the > compactions, and record the histograms in some table. > > > You can always do that yourself. If you suspect you are touching most rows > in a table/region, just issue a scan with a appropriate filter (may have to > implement your own filter, though). Maybe we could a version of RowFilter > that match against multiple keys. > > > -- Lars > > > > ________________________________ > From: Varun Sharma <va...@pinterest.com> > To: user@hbase.apache.org > Sent: Monday, February 18, 2013 1:57 AM > Subject: Optimizing Multi Gets in hbase > > Hi, > > I am trying to batched get(s) on a cluster. Here is the code: > > List<Get> gets = ... > // Prepare my gets with the rows i need > myHTable.get(gets); > > I have two questions about the above scenario: > i) Is this the most optimal way to do this ? > ii) I have a feeling that if there are multiple gets in this case, on the > same region, then each one of those shall instantiate separate scan(s) over > the region even though a single scan is sufficient. Am I mistaken here ? > > Thanks > Varun >