You can see the comments at the top of the method, on why we do not honor
the rpc timeout, and also not the operation timeout.

So here maybe we should introduce a special scan timeout for the meta table?

Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月20日周一 23:45写道:

> Hi Duo, just getting back to this. Thanks for your response.
>
> Actually I'm pretty sure there is a simple retry for all scanner next
> calls. In master branch this occurs
> in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
> #next(). The stub.scan() call in call() passes a callback onComplete which
> includes an error handling call of onError. In onError, a retry is
> scheduled at the end of the method which calls call() again. See
>
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
> .
> Let me know if I'm missing something. Similar logic in branch-2 blocking
> client.
>
> But anyway, most meta calls are small scans which return their results in
> the openScanner call anyway. So improperly tuned rpc timeouts (too short)
> can cause retries in openScanner, and probably next() as well if
> applicable.
>
> I took another look and we do not have any special
> hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
> missing something in the link above, I'm going to move forward adding these
> in the jira.
>
> On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <palomino...@gmail.com>
> wrote:
>
> > Scan will not honor operation timeout configuration as its logic is a bit
> > different compared to normal read/write operations.
> >
> > For scan, usually there is no simple 'retry'(except the open scanner
> call),
> > if you hit an error, usually you need to restart the scan by making a new
> > open scanner call, not retry on the scanner next call.
> >
> > IIRC we have a special hbase.client.scanner.timeout.period and also a
> > special hbase.rpc.timeout for meta?
> >
> > Thanks.
> >
> > Bryan Beaudreault <bbeaudrea...@hubspot.com.invalid> 于2022年6月1日周三
> 00:47写道:
> >
> > > Hi all,
> > >
> > > We just had a production issue where a user-facing API service had a
> low
> > > hbase.rpc.timeout, and this majorly contributed to a meta hotspotting
> > > issue. The issue is, user requests can only be submitted once the
> > necessary
> > > RegionLocation is in the MetaCache. But in a meta hotspotting scenario
> it
> > > may be impossible to return a RegionLocation for hbase:meta in a timely
> > > manner. This will trigger the rpc timeout, which may result in a number
> > of
> > > retries. This retry storm (across many client instances) can further
> > > exacerbate meta hotspotting issues.
> > >
> > > My thought is to decouple meta rpc timeout from user rpc timeouts,
> > because
> > > generally you would prefer to allow a longer meta request to succeed
> > > because it may unblock many user requests.
> > >
> > > I think our current timeouts for meta scans are a bit confusing.
> There's
> > > a hbase.client.meta.operation.timeout, but actually that does not apply
> > to
> > > meta scans. Instead they are configured via hbase.rpc.timeout
> > > and hbase.client.scanner.timeout.period.
> > >
> > > I was considering special casing meta scans so that they are configured
> > via
> > > (new) hbase.client.meta.rpc.timeout and (existing)
> > > hbase.client.meta.operation.timeout. This would be different from
> typical
> > > scan requests, but may be more intuitive overall? Does anyone have any
> > > opinions?
> > >
> > > See https://issues.apache.org/jira/browse/HBASE-27078
> > <https://issues.apache.org/jira/browse/HBASE-27078>
> > >
> >
>

Reply via email to