The other suggestion, sounds better to me where the multi call is modified
to run the Get(s) with this new filter or just initiate a scan with all the
get(s). Since the client automatically groups the multi calls by region
server and only calls the respective region servers. That would eliminate
calling all region servers. This might be an interesting benchmark to run.

On Tue, Feb 19, 2013 at 9:28 AM, Nicolas Liochon <nkey...@gmail.com> wrote:

> Imho,  the easiest thing to do would be to write a filter.
> You need to order the rows, then you can use hints to navigate to the next
> row (SEEK_NEXT_USING_HINT).
> The main drawback I see is that the filter will be invoked on all regions
> servers, including the ones that don't need it. But this would also means
> you have a very specific query pattern (which could be the case, I just
> don't know), and you can still use the startRow / stopRow of the scan, and
> create multiple scan if necessary. I'm also interested in Lars' opinion on
> this.
>
> Nicolas
>
>
>
> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > I have another question, if I am running a scan wrapped around multiple
> > rows in the same region, in the following way:
> >
> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
> >
> > Now, how does execution occur. Is it just a sequential scan across the
> > entire region or does it seek to hfile blocks containing the actual
> values.
> > What I truly mean is, lets say the multi get is on following rows:
> >
> > Row1 : HFileBlock1
> > Row2 : HFileBlock20
> > Row3 : Does not exist
> > Row4 : HFileBlock25
> > Row5 : HFileBlock100
> >
> > The efficient way to do this would be to determine the correct blocks
> using
> > the index and then searching within the blocks for, say Row1. Then, seek
> to
> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> > seeking to + searching within HFileBlocks as needed.
> >
> > I am wondering if a scan wrapped around a Get with multiple rows would do
> > the same ?
> >
> > Thanks
> > Varun
> >
> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <nkey...@gmail.com>
> > wrote:
> >
> > > Looking at the code, it seems possible to do this server side within
> the
> > > multi invocation: we could group the get by region, and do a single
> scan.
> > > We could also add some heuristics if necessary...
> > >
> > >
> > >
> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > > > I should qualify that statement, actually.
> > > >
> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > > returned.
> > > >
> > > > As James Taylor pointed out to me privately: A fairer comparison
> would
> > > > have been to run a scan with a filter that lets x% of the rows pass
> > (i.e.
> > > > the selectivity of the scan would be x%) and compare that to a multi
> > Get
> > > of
> > > > the same x% of the row.
> > > >
> > > > There we found that a Scan+Filter is more efficient that issuing
> multi
> > > > Gets if x is >= 1-2%.
> > > >
> > > >
> > > > Or in other words, translating many Gets into a Scan+Filter is
> > beneficial
> > > > if the Scan would return at least 1-2% of the rows to the client.
> > > > For example:
> > > > if you are looking for less than 10-20k rows in 1m rows, using muli
> > Gets
> > > > is likely more efficient.
> > > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > > Scan+Filter is likely more efficient.
> > > >
> > > >
> > > > Of course this is predicated on whether you have an efficient way to
> > > > represent the rows you are looking for in a filter, so that would
> > > probably
> > > > shift this slightly more towards Gets (just imaging a Filter that to
> > > encode
> > > > 100k random row keys to be matched; since Filters are instantiated
> > store
> > > > there is another natural limit there).
> > > >
> > > >
> > > > As I said below, the crux of the matter is having some histograms of
> > your
> > > > data, so that such a decision could be made automatically.
> > > >
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >  From: lars hofhansl <la...@apache.org>
> > > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > Sent: Monday, February 18, 2013 5:48 PM
> > > > Subject: Re: Optimizing Multi Gets in hbase
> > > >
> > > > As it happens we did some tests around last week.
> > > > Turns out doing Gets in batches instead of a scan still gives you 1/3
> > of
> > > > the performance.
> > > >
> > > > I.e. when you have a table with, say, 10m rows and scanning take N
> > > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
> > > pretty
> > > > impressive.
> > > >
> > > > Now, this is with all data in the cache!
> > > > When the data is not in the cache and the Gets are random it is many
> > > > orders of magnitude slower, as the Gets are sprayed all over the
> disk.
> > In
> > > > that case sorting the Gets and issuing scans would indeed be much
> more
> > > > efficient.
> > > >
> > > >
> > > > The Gets in a batch are already sorted on the client, but as N. says
> it
> > > is
> > > > hard to determine when to turn many Gets into a Scan with filters
> > > > automatically. Without statistics/histograms I'd even wager a guess
> > that
> > > > would be impossible to do.
> > > > Imagine you issue 10000 random Gets, but your table has 10bn rows, in
> > > that
> > > > case it is almost certain that the Gets are faster than a scan.
> > > > Now image the Gets only cover a small key range. With statistics we
> > could
> > > > tell whether it would beneficial to turn this into a scan.
> > > >
> > > > It's not that hard to add statistics to HBase. Would do it as part of
> > the
> > > > compactions, and record the histograms in some table.
> > > >
> > > >
> > > > You can always do that yourself. If you suspect you are touching most
> > > rows
> > > > in a table/region, just issue a scan with a appropriate filter (may
> > have
> > > to
> > > > implement your own filter, though). Maybe we could a version of
> > RowFilter
> > > > that match against multiple keys.
> > > >
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > ________________________________
> > > > From: Varun Sharma <va...@pinterest.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Monday, February 18, 2013 1:57 AM
> > > > Subject: Optimizing Multi Gets in hbase
> > > >
> > > > Hi,
> > > >
> > > > I am trying to batched get(s) on a cluster. Here is the code:
> > > >
> > > > List<Get> gets = ...
> > > > // Prepare my gets with the rows i need
> > > > myHTable.get(gets);
> > > >
> > > > I have two questions about the above scenario:
> > > > i) Is this the most optimal way to do this ?
> > > > ii) I have a feeling that if there are multiple gets in this case, on
> > the
> > > > same region, then each one of those shall instantiate separate
> scan(s)
> > > over
> > > > the region even though a single scan is sufficient. Am I mistaken
> here
> > ?
> > > >
> > > > Thanks
> > > > Varun
> > > >
> > >
> >
>

Reply via email to