Right, that's the problem with the current setting.  If we change the
setting so that the buffer is measured in bytes, then I think there is a
decent 'one size fits all' setting, like 1MB.  You'd still want to adjust it
in some cases, but I think it would be a lot better by default.

Dave

On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <[email protected]> wrote:

> The problem with this setting, is there is no good 'one size fits all'
> value.  If we set it to 1, we do a RPC for ever row, clearly not
> efficient for small rows.  If we set it to something as seemingly
> innocuous as 5 or 10, then map reduces which do a significant amount
> of processing on a row can cause the scanner to time out. The client
> code will also give up if its been more than 60 seconds since the
> scanner was last used, it's possible this code might need to be
> adjusted so we can resume scanning.
>
> I personally set it to anywhere between 1000-5000 for high performance
> jobs on small rows.
>
> The only factor is "can you process the cached chunk of rows in  <
> 60s".  Set the value as large as possible to not violate this and
> you'll achieve max perf.
>
> -ryan
>
> On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <[email protected]> wrote:
> > Thanks for your thoughts.  It's true you can configure the scan buffer
> rows
> > on an HTable or Scan instance, but I think there's something to be said
> to
> > try to work as well as we can out of the box.
> >
> > It would be more complication, but not by much.  To track the idea and
> see
> > what it would look like, I made an issue and attached a proposed patch.
> >
> > Dave
> >
> > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <[email protected]
> >wrote:
> >
> >> And on the Scan as I wrote in my answer which is really really
> convenient.
> >>
> >> Not convinced on using bytes as a value for caching... It would be
> >> also more complicated.
> >>
> >> J-D
> >>
> >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <[email protected]>
> wrote:
> >> > You can set it on a per-HTable basis.  HTable.setScannerCaching(int);
> >> >
> >> >
> >> >
> >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <[email protected]>
> >> wrote:
> >> >> I have some tables with large rows and some tables with very small
> rows,
> >> so
> >> >> I keep my default scanner caching at 1 row, but have to remember to
> set
> >> it
> >> >> higher when scanner tables with smaller rows.  It would be nice to
> have
> >> a
> >> >> default that did something reasonable across tables.
> >> >>
> >> >> Would it make sense to set scanner caching as a count of bytes rather
> >> than a
> >> >> count of rows?  That would make it similar to the write buffer for
> >> batches
> >> >> of puts that get flushed based on size rather than a fixed number of
> >> Puts.
> >> >> Then there could be some default value which should provide decent
> >> >> performance out of the box.
> >> >>
> >> >> Dave
> >> >>
> >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <[email protected]
> >
> >> wrote:
> >> >>
> >> >>> To set this per scan you should be able to do:
> >> >>>
> >> >>> Scan s = new Scan()
> >> >>> s.setCaching(...)
> >> >>>
> >> >>> (I think this works anyway)
> >> >>>
> >> >>>
> >> >>> The other thing that I've found useful is using a PageFilter on
> scans:
> >> >>>
> >> >>>
> >>
> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html
> >> >>>
> >> >>> I believe this is applied independently on each region server (?) so
> >> you
> >> >>> still need to do your own counting in iterating the results, but it
> can
> >> be
> >> >>> used to early out on the server side separately from the scanner
> >> caching
> >> >>> value.
> >> >>>
> >> >>> --gh
> >> >>>
> >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <[email protected]> wrote:
> >> >>>
> >> >>> > There is this in the configuration:
> >> >>> >
> >> >>> >  <property>
> >> >>> >    <name>hbase.client.scanner.caching</name>
> >> >>> >    <value>1</value>
> >> >>> >    <description>Number of rows that will be fetched when calling
> next
> >> >>> >    on a scanner if it is not served from memory. Higher caching
> >> values
> >> >>> >    will enable faster scanners but will eat up more memory and
> some
> >> >>> >    calls of next may take longer and longer times when the cache
> is
> >> >>> empty.
> >> >>> >    </description>
> >> >>> >  </property>
> >> >>> >
> >> >>> >
> >> >>> > Being able to do it per Scan sounds like something we should add.
> >> >>> >
> >> >>> > St.Ack
> >> >>> >
> >> >>> >
> >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein
> >> >>> > <[email protected]>wrote:
> >> >>> >
> >> >>> > >   Hi,
> >> >>> > > Is there a way to specify a limit on number of returned records
> for
> >> >>> scan?
> >> >>> > >  I
> >> >>> > > don¹t see any way to do this when building the scan.  If there
> is,
> >> that
> >> >>> > > would be great.  If not, what about when iterating over the
> result?
> >>  If
> >> >>> I
> >> >>> > > exit the loop when I reach my limit, will that approximate this
> >> clause?
> >> >>> > I
> >> >>> > > guess my real question is about how scan is implemented in the
> >> client.
> >> >>> > >  I.e.
> >> >>> > > How many records are returned from Hbase at a time as I iterate
> >> through
> >> >>> > the
> >> >>> > > scan result?  If I want 1,000 records and 100 get returned at a
> >> time,
> >> >>> > then
> >> >>> > > I¹m in good shape.  On the other hand, if I want 10 records and
> get
> >> 100
> >> >>> > at
> >> >>> > > a
> >> >>> > > time, it¹s a bit wasteful, though the waste is bounded.
> >> >>> > >
> >> >>> > > Thanks,
> >> >>> > > Adam
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >> >
> >>
> >
>

Reply via email to