Re: Separately configurable client meta rpc timeout

Bryan Beaudreault Mon, 20 Jun 2022 13:17:43 -0700

Thank you both for the input. I will get a PR up for that shortly.

Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142 for
branch-2 blocking client -- "Scanner timeout should take precedence over
rpc timeout". I noticed that you changed this behavior for the async client
a few years ago Duo, and I think it makes sense to do for the blocking
client. Otherwise setting a special meta scanner timeout won't really take
effect unless we also provide a special meta rpc timeout. Per Andrew's
comment (which I 100% agree), it seems better to unify the clients than to
create another new config.


On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <[email protected]> wrote:

> Our default position should be to resist adding new configuration
> variables, but in this case, I think it makes sense.
> +1 for adding a distinct timeout setting for meta. Definitely a valid
> special case.
>
> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <[email protected]>
> wrote:
>
> > You can see the comments at the top of the method, on why we do not honor
> > the rpc timeout, and also not the operation timeout.
> >
> > So here maybe we should introduce a special scan timeout for the meta
> > table?
> >
> > Bryan Beaudreault <[email protected]> 于2022年6月20日周一
> > 23:45写道：
> >
> > > Hi Duo, just getting back to this. Thanks for your response.
> > >
> > > Actually I'm pretty sure there is a simple retry for all scanner next
> > > calls. In master branch this occurs
> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called from
> > > #next(). The stub.scan() call in call() passes a callback onComplete
> > which
> > > includes an error handling call of onError. In onError, a retry is
> > > scheduled at the end of the method which calls call() again. See
> > >
> > >
> >
> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584>
> > > .
> > > Let me know if I'm missing something. Similar logic in branch-2
> blocking
> > > client.
> > >
> > > But anyway, most meta calls are small scans which return their results
> in
> > > the openScanner call anyway. So improperly tuned rpc timeouts (too
> short)
> > > can cause retries in openScanner, and probably next() as well if
> > > applicable.
> > >
> > > I took another look and we do not have any special
> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless I'm
> > > missing something in the link above, I'm going to move forward adding
> > these
> > > in the jira.
> > >
> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <[email protected]>
> > > wrote:
> > >
> > > > Scan will not honor operation timeout configuration as its logic is a
> > bit
> > > > different compared to normal read/write operations.
> > > >
> > > > For scan, usually there is no simple 'retry'(except the open scanner
> > > call),
> > > > if you hit an error, usually you need to restart the scan by making a
> > new
> > > > open scanner call, not retry on the scanner next call.
> > > >
> > > > IIRC we have a special hbase.client.scanner.timeout.period and also a
> > > > special hbase.rpc.timeout for meta?
> > > >
> > > > Thanks.
> > > >
> > > > Bryan Beaudreault <[email protected]> 于2022年6月1日周三
> > > 00:47写道：
> > > >
> > > > > Hi all,
> > > > >
> > > > > We just had a production issue where a user-facing API service had
> a
> > > low
> > > > > hbase.rpc.timeout, and this majorly contributed to a meta
> hotspotting
> > > > > issue. The issue is, user requests can only be submitted once the
> > > > necessary
> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting
> > scenario
> > > it
> > > > > may be impossible to return a RegionLocation for hbase:meta in a
> > timely
> > > > > manner. This will trigger the rpc timeout, which may result in a
> > number
> > > > of
> > > > > retries. This retry storm (across many client instances) can
> further
> > > > > exacerbate meta hotspotting issues.
> > > > >
> > > > > My thought is to decouple meta rpc timeout from user rpc timeouts,
> > > > because
> > > > > generally you would prefer to allow a longer meta request to
> succeed
> > > > > because it may unblock many user requests.
> > > > >
> > > > > I think our current timeouts for meta scans are a bit confusing.
> > > There's
> > > > > a hbase.client.meta.operation.timeout, but actually that does not
> > apply
> > > > to
> > > > > meta scans. Instead they are configured via hbase.rpc.timeout
> > > > > and hbase.client.scanner.timeout.period.
> > > > >
> > > > > I was considering special casing meta scans so that they are
> > configured
> > > > via
> > > > > (new) hbase.client.meta.rpc.timeout and (existing)
> > > > > hbase.client.meta.operation.timeout. This would be different from
> > > typical
> > > > > scan requests, but may be more intuitive overall? Does anyone have
> > any
> > > > > opinions?
> > > > >
> > > > > See https://issues.apache.org/jira/browse/HBASE-27078
> <https://issues.apache.org/jira/browse/HBASE-27078>
> > > > <https://issues.apache.org/jira/browse/HBASE-27078
> <https://issues.apache.org/jira/browse/HBASE-27078>
> >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Unrest, ignorance distilled, nihilistic imbeciles -
> It's what we’ve earned
> Welcome, apocalypse, what’s taken you so long?
> Bring us the fitting end that we’ve been counting on
> - A23, Welcome, Apocalypse
>

Re: Separately configurable client meta rpc timeout

Reply via email to