Actually, it looks like hbase.rpc.timeout currently applies to the openScanner call (which is all that's necessary for most meta scans, since they are small). So I think we do also need an hbase.client.meta.rpc.timeout config after all.
On Mon, Jun 20, 2022 at 4:17 PM Bryan Beaudreault <bbeaudrea...@hubspot.com> wrote: > Thank you both for the input. I will get a PR up for that shortly. > > Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142 > for branch-2 blocking client -- "Scanner timeout should take precedence > over rpc timeout". I noticed that you changed this behavior for the async > client a few years ago Duo, and I think it makes sense to do for the > blocking client. Otherwise setting a special meta scanner timeout won't > really take effect unless we also provide a special meta rpc timeout. Per > Andrew's comment (which I 100% agree), it seems better to unify the clients > than to create another new config. > > On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <apurt...@apache.org> > wrote: > >> Our default position should be to resist adding new configuration >> variables, but in this case, I think it makes sense. >> +1 for adding a distinct timeout setting for meta. Definitely a valid >> special case. >> >> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <palomino...@gmail.com> >> wrote: >> >> > You can see the comments at the top of the method, on why we do not >> honor >> > the rpc timeout, and also not the operation timeout. >> > >> > So here maybe we should introduce a special scan timeout for the meta >> > table? >> > >> > Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月20日周一 >> > 23:45写道: >> > >> > > Hi Duo, just getting back to this. Thanks for your response. >> > > >> > > Actually I'm pretty sure there is a simple retry for all scanner next >> > > calls. In master branch this occurs >> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from >> > > #next(). The stub.scan() call in call() passes a callback onComplete >> > which >> > > includes an error handling call of onError. In onError, a retry is >> > > scheduled at the end of the method which calls call() again. See >> > > >> > > >> > >> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584 >> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584> >> > > . >> > > Let me know if I'm missing something. Similar logic in branch-2 >> blocking >> > > client. >> > > >> > > But anyway, most meta calls are small scans which return their >> results in >> > > the openScanner call anyway. So improperly tuned rpc timeouts (too >> short) >> > > can cause retries in openScanner, and probably next() as well if >> > > applicable. >> > > >> > > I took another look and we do not have any special >> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm >> > > missing something in the link above, I'm going to move forward adding >> > these >> > > in the jira. >> > > >> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <palomino...@gmail.com> >> > > wrote: >> > > >> > > > Scan will not honor operation timeout configuration as its logic is >> a >> > bit >> > > > different compared to normal read/write operations. >> > > > >> > > > For scan, usually there is no simple 'retry'(except the open scanner >> > > call), >> > > > if you hit an error, usually you need to restart the scan by making >> a >> > new >> > > > open scanner call, not retry on the scanner next call. >> > > > >> > > > IIRC we have a special hbase.client.scanner.timeout.period and also >> a >> > > > special hbase.rpc.timeout for meta? >> > > > >> > > > Thanks. >> > > > >> > > > Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月1日周三 >> > > 00:47写道: >> > > > >> > > > > Hi all, >> > > > > >> > > > > We just had a production issue where a user-facing API service >> had a >> > > low >> > > > > hbase.rpc.timeout, and this majorly contributed to a meta >> hotspotting >> > > > > issue. The issue is, user requests can only be submitted once the >> > > > necessary >> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting >> > scenario >> > > it >> > > > > may be impossible to return a RegionLocation for hbase:meta in a >> > timely >> > > > > manner. This will trigger the rpc timeout, which may result in a >> > number >> > > > of >> > > > > retries. This retry storm (across many client instances) can >> further >> > > > > exacerbate meta hotspotting issues. >> > > > > >> > > > > My thought is to decouple meta rpc timeout from user rpc timeouts, >> > > > because >> > > > > generally you would prefer to allow a longer meta request to >> succeed >> > > > > because it may unblock many user requests. >> > > > > >> > > > > I think our current timeouts for meta scans are a bit confusing. >> > > There's >> > > > > a hbase.client.meta.operation.timeout, but actually that does not >> > apply >> > > > to >> > > > > meta scans. Instead they are configured via hbase.rpc.timeout >> > > > > and hbase.client.scanner.timeout.period. >> > > > > >> > > > > I was considering special casing meta scans so that they are >> > configured >> > > > via >> > > > > (new) hbase.client.meta.rpc.timeout and (existing) >> > > > > hbase.client.meta.operation.timeout. This would be different from >> > > typical >> > > > > scan requests, but may be more intuitive overall? Does anyone have >> > any >> > > > > opinions? >> > > > > >> > > > > See https://issues.apache.org/jira/browse/HBASE-27078 >> <https://issues.apache.org/jira/browse/HBASE-27078> >> > > > <https://issues.apache.org/jira/browse/HBASE-27078 >> <https://issues.apache.org/jira/browse/HBASE-27078> >> > >> > > > > >> > > > >> > > >> > >> >> >> -- >> Best regards, >> Andrew >> >> Unrest, ignorance distilled, nihilistic imbeciles - >> It's what we’ve earned >> Welcome, apocalypse, what’s taken you so long? >> Bring us the fitting end that we’ve been counting on >> - A23, Welcome, Apocalypse >> >