Our default position should be to resist adding new configuration variables, but in this case, I think it makes sense. +1 for adding a distinct timeout setting for meta. Definitely a valid special case.
On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <palomino...@gmail.com> wrote: > You can see the comments at the top of the method, on why we do not honor > the rpc timeout, and also not the operation timeout. > > So here maybe we should introduce a special scan timeout for the meta > table? > > Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月20日周一 > 23:45写道: > > > Hi Duo, just getting back to this. Thanks for your response. > > > > Actually I'm pretty sure there is a simple retry for all scanner next > > calls. In master branch this occurs > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from > > #next(). The stub.scan() call in call() passes a callback onComplete > which > > includes an error handling call of onError. In onError, a retry is > > scheduled at the end of the method which calls call() again. See > > > > > https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584 > > . > > Let me know if I'm missing something. Similar logic in branch-2 blocking > > client. > > > > But anyway, most meta calls are small scans which return their results in > > the openScanner call anyway. So improperly tuned rpc timeouts (too short) > > can cause retries in openScanner, and probably next() as well if > > applicable. > > > > I took another look and we do not have any special > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm > > missing something in the link above, I'm going to move forward adding > these > > in the jira. > > > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <palomino...@gmail.com> > > wrote: > > > > > Scan will not honor operation timeout configuration as its logic is a > bit > > > different compared to normal read/write operations. > > > > > > For scan, usually there is no simple 'retry'(except the open scanner > > call), > > > if you hit an error, usually you need to restart the scan by making a > new > > > open scanner call, not retry on the scanner next call. > > > > > > IIRC we have a special hbase.client.scanner.timeout.period and also a > > > special hbase.rpc.timeout for meta? > > > > > > Thanks. > > > > > > Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月1日周三 > > 00:47写道: > > > > > > > Hi all, > > > > > > > > We just had a production issue where a user-facing API service had a > > low > > > > hbase.rpc.timeout, and this majorly contributed to a meta hotspotting > > > > issue. The issue is, user requests can only be submitted once the > > > necessary > > > > RegionLocation is in the MetaCache. But in a meta hotspotting > scenario > > it > > > > may be impossible to return a RegionLocation for hbase:meta in a > timely > > > > manner. This will trigger the rpc timeout, which may result in a > number > > > of > > > > retries. This retry storm (across many client instances) can further > > > > exacerbate meta hotspotting issues. > > > > > > > > My thought is to decouple meta rpc timeout from user rpc timeouts, > > > because > > > > generally you would prefer to allow a longer meta request to succeed > > > > because it may unblock many user requests. > > > > > > > > I think our current timeouts for meta scans are a bit confusing. > > There's > > > > a hbase.client.meta.operation.timeout, but actually that does not > apply > > > to > > > > meta scans. Instead they are configured via hbase.rpc.timeout > > > > and hbase.client.scanner.timeout.period. > > > > > > > > I was considering special casing meta scans so that they are > configured > > > via > > > > (new) hbase.client.meta.rpc.timeout and (existing) > > > > hbase.client.meta.operation.timeout. This would be different from > > typical > > > > scan requests, but may be more intuitive overall? Does anyone have > any > > > > opinions? > > > > > > > > See https://issues.apache.org/jira/browse/HBASE-27078 > > > <https://issues.apache.org/jira/browse/HBASE-27078> > > > > > > > > > > -- Best regards, Andrew Unrest, ignorance distilled, nihilistic imbeciles - It's what we’ve earned Welcome, apocalypse, what’s taken you so long? Bring us the fitting end that we’ve been counting on - A23, Welcome, Apocalypse