Hi Duo, just getting back to this. Thanks for your response. Actually I'm pretty sure there is a simple retry for all scanner next calls. In master branch this occurs in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from #next(). The stub.scan() call in call() passes a callback onComplete which includes an error handling call of onError. In onError, a retry is scheduled at the end of the method which calls call() again. See https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584. Let me know if I'm missing something. Similar logic in branch-2 blocking client.
But anyway, most meta calls are small scans which return their results in the openScanner call anyway. So improperly tuned rpc timeouts (too short) can cause retries in openScanner, and probably next() as well if applicable. I took another look and we do not have any special hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm missing something in the link above, I'm going to move forward adding these in the jira. On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <palomino...@gmail.com> wrote: > Scan will not honor operation timeout configuration as its logic is a bit > different compared to normal read/write operations. > > For scan, usually there is no simple 'retry'(except the open scanner call), > if you hit an error, usually you need to restart the scan by making a new > open scanner call, not retry on the scanner next call. > > IIRC we have a special hbase.client.scanner.timeout.period and also a > special hbase.rpc.timeout for meta? > > Thanks. > > Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月1日周三 00:47写道: > > > Hi all, > > > > We just had a production issue where a user-facing API service had a low > > hbase.rpc.timeout, and this majorly contributed to a meta hotspotting > > issue. The issue is, user requests can only be submitted once the > necessary > > RegionLocation is in the MetaCache. But in a meta hotspotting scenario it > > may be impossible to return a RegionLocation for hbase:meta in a timely > > manner. This will trigger the rpc timeout, which may result in a number > of > > retries. This retry storm (across many client instances) can further > > exacerbate meta hotspotting issues. > > > > My thought is to decouple meta rpc timeout from user rpc timeouts, > because > > generally you would prefer to allow a longer meta request to succeed > > because it may unblock many user requests. > > > > I think our current timeouts for meta scans are a bit confusing. There's > > a hbase.client.meta.operation.timeout, but actually that does not apply > to > > meta scans. Instead they are configured via hbase.rpc.timeout > > and hbase.client.scanner.timeout.period. > > > > I was considering special casing meta scans so that they are configured > via > > (new) hbase.client.meta.rpc.timeout and (existing) > > hbase.client.meta.operation.timeout. This would be different from typical > > scan requests, but may be more intuitive overall? Does anyone have any > > opinions? > > > > See https://issues.apache.org/jira/browse/HBASE-27078 > <https://issues.apache.org/jira/browse/HBASE-27078> > > >