Right, that's the problem with the current setting. If we change the setting so that the buffer is measured in bytes, then I think there is a decent 'one size fits all' setting, like 1MB. You'd still want to adjust it in some cases, but I think it would be a lot better by default.
Dave On Fri, Nov 20, 2009 at 4:36 PM, Ryan Rawson <[email protected]> wrote: > The problem with this setting, is there is no good 'one size fits all' > value. If we set it to 1, we do a RPC for ever row, clearly not > efficient for small rows. If we set it to something as seemingly > innocuous as 5 or 10, then map reduces which do a significant amount > of processing on a row can cause the scanner to time out. The client > code will also give up if its been more than 60 seconds since the > scanner was last used, it's possible this code might need to be > adjusted so we can resume scanning. > > I personally set it to anywhere between 1000-5000 for high performance > jobs on small rows. > > The only factor is "can you process the cached chunk of rows in < > 60s". Set the value as large as possible to not violate this and > you'll achieve max perf. > > -ryan > > On Fri, Nov 20, 2009 at 4:20 PM, Dave Latham <[email protected]> wrote: > > Thanks for your thoughts. It's true you can configure the scan buffer > rows > > on an HTable or Scan instance, but I think there's something to be said > to > > try to work as well as we can out of the box. > > > > It would be more complication, but not by much. To track the idea and > see > > what it would look like, I made an issue and attached a proposed patch. > > > > Dave > > > > On Fri, Nov 20, 2009 at 1:55 PM, Jean-Daniel Cryans <[email protected] > >wrote: > > > >> And on the Scan as I wrote in my answer which is really really > convenient. > >> > >> Not convinced on using bytes as a value for caching... It would be > >> also more complicated. > >> > >> J-D > >> > >> On Fri, Nov 20, 2009 at 1:45 PM, Ryan Rawson <[email protected]> > wrote: > >> > You can set it on a per-HTable basis. HTable.setScannerCaching(int); > >> > > >> > > >> > > >> > On Fri, Nov 20, 2009 at 1:43 PM, Dave Latham <[email protected]> > >> wrote: > >> >> I have some tables with large rows and some tables with very small > rows, > >> so > >> >> I keep my default scanner caching at 1 row, but have to remember to > set > >> it > >> >> higher when scanner tables with smaller rows. It would be nice to > have > >> a > >> >> default that did something reasonable across tables. > >> >> > >> >> Would it make sense to set scanner caching as a count of bytes rather > >> than a > >> >> count of rows? That would make it similar to the write buffer for > >> batches > >> >> of puts that get flushed based on size rather than a fixed number of > >> Puts. > >> >> Then there could be some default value which should provide decent > >> >> performance out of the box. > >> >> > >> >> Dave > >> >> > >> >> On Fri, Nov 20, 2009 at 12:35 PM, Gary Helmling <[email protected] > > > >> wrote: > >> >> > >> >>> To set this per scan you should be able to do: > >> >>> > >> >>> Scan s = new Scan() > >> >>> s.setCaching(...) > >> >>> > >> >>> (I think this works anyway) > >> >>> > >> >>> > >> >>> The other thing that I've found useful is using a PageFilter on > scans: > >> >>> > >> >>> > >> > http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/filter/PageFilter.html > >> >>> > >> >>> I believe this is applied independently on each region server (?) so > >> you > >> >>> still need to do your own counting in iterating the results, but it > can > >> be > >> >>> used to early out on the server side separately from the scanner > >> caching > >> >>> value. > >> >>> > >> >>> --gh > >> >>> > >> >>> On Fri, Nov 20, 2009 at 3:04 PM, stack <[email protected]> wrote: > >> >>> > >> >>> > There is this in the configuration: > >> >>> > > >> >>> > <property> > >> >>> > <name>hbase.client.scanner.caching</name> > >> >>> > <value>1</value> > >> >>> > <description>Number of rows that will be fetched when calling > next > >> >>> > on a scanner if it is not served from memory. Higher caching > >> values > >> >>> > will enable faster scanners but will eat up more memory and > some > >> >>> > calls of next may take longer and longer times when the cache > is > >> >>> empty. > >> >>> > </description> > >> >>> > </property> > >> >>> > > >> >>> > > >> >>> > Being able to do it per Scan sounds like something we should add. > >> >>> > > >> >>> > St.Ack > >> >>> > > >> >>> > > >> >>> > On Fri, Nov 20, 2009 at 11:43 AM, Adam Silberstein > >> >>> > <[email protected]>wrote: > >> >>> > > >> >>> > > Hi, > >> >>> > > Is there a way to specify a limit on number of returned records > for > >> >>> scan? > >> >>> > > I > >> >>> > > don¹t see any way to do this when building the scan. If there > is, > >> that > >> >>> > > would be great. If not, what about when iterating over the > result? > >> If > >> >>> I > >> >>> > > exit the loop when I reach my limit, will that approximate this > >> clause? > >> >>> > I > >> >>> > > guess my real question is about how scan is implemented in the > >> client. > >> >>> > > I.e. > >> >>> > > How many records are returned from Hbase at a time as I iterate > >> through > >> >>> > the > >> >>> > > scan result? If I want 1,000 records and 100 get returned at a > >> time, > >> >>> > then > >> >>> > > I¹m in good shape. On the other hand, if I want 10 records and > get > >> 100 > >> >>> > at > >> >>> > > a > >> >>> > > time, it¹s a bit wasteful, though the waste is bounded. > >> >>> > > > >> >>> > > Thanks, > >> >>> > > Adam > >> >>> > > > >> >>> > > >> >>> > >> >> > >> > > >> > > >
