Thanks again for your inputs here. I have a PR for this here: https://github.com/apache/hbase/pull/4557
On Mon, Jun 20, 2022 at 5:57 PM Bryan Beaudreault <[email protected]> wrote: > Actually, it looks like hbase.rpc.timeout currently applies to the > openScanner call (which is all that's necessary for most meta scans, since > they are small). So I think we do also need an > hbase.client.meta.rpc.timeout config after all. > > On Mon, Jun 20, 2022 at 4:17 PM Bryan Beaudreault < > [email protected]> wrote: > >> Thank you both for the input. I will get a PR up for that shortly. >> >> Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142 >> for branch-2 blocking client -- "Scanner timeout should take precedence >> over rpc timeout". I noticed that you changed this behavior for the async >> client a few years ago Duo, and I think it makes sense to do for the >> blocking client. Otherwise setting a special meta scanner timeout won't >> really take effect unless we also provide a special meta rpc timeout. Per >> Andrew's comment (which I 100% agree), it seems better to unify the clients >> than to create another new config. >> >> On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <[email protected]> >> wrote: >> >>> Our default position should be to resist adding new configuration >>> variables, but in this case, I think it makes sense. >>> +1 for adding a distinct timeout setting for meta. Definitely a valid >>> special case. >>> >>> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <[email protected]> >>> wrote: >>> >>> > You can see the comments at the top of the method, on why we do not >>> honor >>> > the rpc timeout, and also not the operation timeout. >>> > >>> > So here maybe we should introduce a special scan timeout for the meta >>> > table? >>> > >>> > Bryan Beaudreault <[email protected]> 于2022年6月20日周一 >>> > 23:45写道: >>> > >>> > > Hi Duo, just getting back to this. Thanks for your response. >>> > > >>> > > Actually I'm pretty sure there is a simple retry for all scanner next >>> > > calls. In master branch this occurs >>> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called >>> from >>> > > #next(). The stub.scan() call in call() passes a callback onComplete >>> > which >>> > > includes an error handling call of onError. In onError, a retry is >>> > > scheduled at the end of the method which calls call() again. See >>> > > >>> > > >>> > >>> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584 >>> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584> >>> > > . >>> > > Let me know if I'm missing something. Similar logic in branch-2 >>> blocking >>> > > client. >>> > > >>> > > But anyway, most meta calls are small scans which return their >>> results in >>> > > the openScanner call anyway. So improperly tuned rpc timeouts (too >>> short) >>> > > can cause retries in openScanner, and probably next() as well if >>> > > applicable. >>> > > >>> > > I took another look and we do not have any special >>> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless >>> I'm >>> > > missing something in the link above, I'm going to move forward adding >>> > these >>> > > in the jira. >>> > > >>> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <[email protected] >>> > >>> > > wrote: >>> > > >>> > > > Scan will not honor operation timeout configuration as its logic >>> is a >>> > bit >>> > > > different compared to normal read/write operations. >>> > > > >>> > > > For scan, usually there is no simple 'retry'(except the open >>> scanner >>> > > call), >>> > > > if you hit an error, usually you need to restart the scan by >>> making a >>> > new >>> > > > open scanner call, not retry on the scanner next call. >>> > > > >>> > > > IIRC we have a special hbase.client.scanner.timeout.period and >>> also a >>> > > > special hbase.rpc.timeout for meta? >>> > > > >>> > > > Thanks. >>> > > > >>> > > > Bryan Beaudreault <[email protected]> 于2022年6月1日周三 >>> > > 00:47写道: >>> > > > >>> > > > > Hi all, >>> > > > > >>> > > > > We just had a production issue where a user-facing API service >>> had a >>> > > low >>> > > > > hbase.rpc.timeout, and this majorly contributed to a meta >>> hotspotting >>> > > > > issue. The issue is, user requests can only be submitted once the >>> > > > necessary >>> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting >>> > scenario >>> > > it >>> > > > > may be impossible to return a RegionLocation for hbase:meta in a >>> > timely >>> > > > > manner. This will trigger the rpc timeout, which may result in a >>> > number >>> > > > of >>> > > > > retries. This retry storm (across many client instances) can >>> further >>> > > > > exacerbate meta hotspotting issues. >>> > > > > >>> > > > > My thought is to decouple meta rpc timeout from user rpc >>> timeouts, >>> > > > because >>> > > > > generally you would prefer to allow a longer meta request to >>> succeed >>> > > > > because it may unblock many user requests. >>> > > > > >>> > > > > I think our current timeouts for meta scans are a bit confusing. >>> > > There's >>> > > > > a hbase.client.meta.operation.timeout, but actually that does not >>> > apply >>> > > > to >>> > > > > meta scans. Instead they are configured via hbase.rpc.timeout >>> > > > > and hbase.client.scanner.timeout.period. >>> > > > > >>> > > > > I was considering special casing meta scans so that they are >>> > configured >>> > > > via >>> > > > > (new) hbase.client.meta.rpc.timeout and (existing) >>> > > > > hbase.client.meta.operation.timeout. This would be different from >>> > > typical >>> > > > > scan requests, but may be more intuitive overall? Does anyone >>> have >>> > any >>> > > > > opinions? >>> > > > > >>> > > > > See https://issues.apache.org/jira/browse/HBASE-27078 >>> <https://issues.apache.org/jira/browse/HBASE-27078> >>> > > > <https://issues.apache.org/jira/browse/HBASE-27078 >>> <https://issues.apache.org/jira/browse/HBASE-27078> >>> > >>> > > > > >>> > > > >>> > > >>> > >>> >>> >>> -- >>> Best regards, >>> Andrew >>> >>> Unrest, ignorance distilled, nihilistic imbeciles - >>> It's what we’ve earned >>> Welcome, apocalypse, what’s taken you so long? >>> Bring us the fitting end that we’ve been counting on >>> - A23, Welcome, Apocalypse >>> >>
