Re: scan.setTimeRange performance

Eugeny Morozov Mon, 24 Sep 2012 23:53:24 -0700

Hi, Jean-Daniel, thanks for the reply.

I've found the reason. And it's quite simple to understand. Don't know why
I've missed it.
The reason for slow processing was the fact that specified time range was
too thin.


So, firstly Region Server filter out HFiles, which it will scan.
Then, it reads them (or just one HFile as in my case) and trying to find
first 10 to 50 values that fall into given time range. If time range is
thin, then Region Server must read the HFile almost completely. At the
contrast, whey there is no time range, then it just return first 10 to 50
values.

That's the difference.
That's actually the reason I'm sure that time range is working correctly =)

On Mon, Sep 24, 2012 at 11:15 PM, Jean-Daniel Cryans <jdcry...@apache.org>wrote:

> Hi Eugeny,
>
> The mailing list stripped your attachement (as it often does) so you
> might want to put it on a public web server.
>
> I don't have much to contribute except than to point to a recent
> conversation that you can find here:
> http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28722
>
> Hope this helps,
>
> J-D
>
> On Fri, Sep 21, 2012 at 5:03 AM, Eugeny Morozov
> <emoro...@griddynamics.com> wrote:
> > Hello!
> >
> > It is known and I saw it in the code that time range set by
> > scan.setTimeRange is used to filter out HFiles for further scan.
> > Which means that speed of following scanner.next must be almost zero in
> case
> > if I set time range far away in future. I am sure that I do not have
> HFiles
> > that fall into the set time range period.
> >
> > But - and here is the question - surprisingly scanning with set time
> range
> > is far longer than without it.
> >
> > My results are following:
> > Use range [false]. Time spent (avg): [0]
> > Use range [true]. Time spent (avg): [525]
> >
> > There are KeyValues listed, when time range is not used.
> >
> > The code is following:
> >     public static void run(boolean useRange, HTable table) throws
> Exception
> > {
> >         Scan scan = new Scan().addFamily( family ).setCaching( -1
> > ).setCacheBlocks( false );
> >         scan.setStartRow( random start row );
> >         if (useRange) scan.setTimeRange(1348114401600L, 1348114401700L);
> >
> >         ResultScanner scanner = table.getScanner(scan);
> >         for(int i = 0 ; i < N; i++) { // There were bunch of measures,
> where
> > N was from 10 to 50
> >             long time = System.currentTimeMillis();
> >             result = scanner.next();
> >             sum += (System.currentTimeMillis() - time) / N;
> >         }
> >     }
> >
> > Of course such a measurements are include all sort of noise like network
> > overhead, etc, but I'm using virtual machine on my own box, and at the
> time
> > I do measurement there is no other activity neither on my own box or this
> > virtual machine, so such a noise is minimum.
> >
> > Also I've used YourKit to measure tracing and sampling of running
> > HRegionServer, but didn't found anything suspicious. Though I didn't
> look at
> > heap and GC perf. Tracing is in attach.
> >
> > So, the question is why is it so slow when time range is set and so fast
> > without it?
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > emoro...@griddynamics.com
>



-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emoro...@griddynamics.com

Re: scan.setTimeRange performance

Reply via email to