rmdmattingly opened a new pull request, #5713:
URL: https://github.com/apache/hbase/pull/5713

   ## Background
   
   https://issues.apache.org/jira/browse/HBASE-28385
   
   Let's say you're running a table scan with a throttle of 100MB/sec per 
RegionServer. Ideally your scans are going to pull down large results, often 
containing hundreds or thousands of blocks.
   
   You will estimate each scan as costing a single block of read capacity, and 
if your quota is already exhausted then the server will evaluate the backoff 
required for your estimated consumption (1 block) to be available. This will 
often be ~1ms, causing your retries to basically be immediate.
   
   Obviously it will routinely take much longer than 1ms for 100MB of IO to 
become available in the given configuration, so your retries will be destined 
to fail. At worst this can cause a saturation of your server's RPC layer, and 
at best this causes erroneous exhaustion of the client's retries.
   
   ## Proposal
   
   This makes two major changes:
   1. It introduces a minimum waitInterval of 100ms. Throttling with a 
near-zero backoff is not useful. If we’ve reached the point of quota saturation 
then there should be a minimum backoff that must at least be thrown.
   2. It introduces a more complex estimation of scan workloads. Specifically:
       * We continue to estimate initial scan calls very optimistically at, or 
near, 1 block of IO.
       * We begin tracking the max block bytes scanned (`maxBlockBytesScanned`) 
by a single `Scanner#next` call for each scanner in the `RegionScannerHolder`.
       * Keep in mind that there is already a server configured maxResultSize 
for scanners, and a call sequence number which increments with each 
`Scanner#next` call, beginning with 0
       * With all of these inputs, we estimate scan workload to be 
`Math.min(maxResultSize, nextSeqNum*maxBlockBytesScanned)`
   
   ## Open questions
   - [ ] Should the new minimum waitInterval be configurable?
   
   ## Testing
   
   I've deployed this to a test cluster and confirmed that large scan estimates 
quickly produced useful estimates and, consequently, meaningful waitInterval 
backoffs. I can also try to write a unit test which confirms this behavior.
   
   cc @hgromer @eab148 @bozzkar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to