I should qualify that statement, actually.

I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned.

As James Taylor pointed out to me privately: A fairer comparison would have 
been to run a scan with a filter that lets x% of the rows pass (i.e. the 
selectivity of the scan would be x%) and compare that to a multi Get of the 
same x% of the row.

There we found that a Scan+Filter is more efficient that issuing multi Gets if 
x is >= 1-2%.


Or in other words, translating many Gets into a Scan+Filter is beneficial if 
the Scan would return at least 1-2% of the rows to the client.
For example:
if you are looking for less than 10-20k rows in 1m rows, using muli Gets is 
likely more efficient.
if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is 
likely more efficient.


Of course this is predicated on whether you have an efficient way to represent 
the rows you are looking for in a filter, so that would probably shift this 
slightly more towards Gets (just imaging a Filter that to encode 100k random 
row keys to be matched; since Filters are instantiated store there is another 
natural limit there).


As I said below, the crux of the matter is having some histograms of your data, 
so that such a decision could be made automatically.


-- Lars



________________________________
 From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <user@hbase.apache.org> 
Sent: Monday, February 18, 2013 5:48 PM
Subject: Re: Optimizing Multi Gets in hbase
 
As it happens we did some tests around last week.
Turns out doing Gets in batches instead of a scan still gives you 1/3 of the 
performance.

I.e. when you have a table with, say, 10m rows and scanning take N seconds, 
then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive.

Now, this is with all data in the cache!
When the data is not in the cache and the Gets are random it is many orders of 
magnitude slower, as the Gets are sprayed all over the disk. In that case 
sorting the Gets and issuing scans would indeed be much more efficient.


The Gets in a batch are already sorted on the client, but as N. says it is hard 
to determine when to turn many Gets into a Scan with filters automatically. 
Without statistics/histograms I'd even wager a guess that would be impossible 
to do.
Imagine you issue 10000 random Gets, but your table has 10bn rows, in that case 
it is almost certain that the Gets are faster than a scan.
Now image the Gets only cover a small key range. With statistics we could tell 
whether it would beneficial to turn this into a scan.

It's not that hard to add statistics to HBase. Would do it as part of the 
compactions, and record the histograms in some table.


You can always do that yourself. If you suspect you are touching most rows in a 
table/region, just issue a scan with a appropriate filter (may have to 
implement your own filter, though). Maybe we could a version of RowFilter that 
match against multiple keys.


-- Lars



________________________________
From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Monday, February 18, 2013 1:57 AM
Subject: Optimizing Multi Gets in hbase

Hi,

I am trying to batched get(s) on a cluster. Here is the code:

List<Get> gets = ...
// Prepare my gets with the rows i need
myHTable.get(gets);

I have two questions about the above scenario:
i) Is this the most optimal way to do this ?
ii) I have a feeling that if there are multiple gets in this case, on the
same region, then each one of those shall instantiate separate scan(s) over
the region even though a single scan is sufficient. Am I mistaken here ?

Thanks
Varun

Reply via email to