Re: Separately configurable client meta rpc timeout

Bryan Beaudreault Fri, 24 Jun 2022 08:50:03 -0700

Thanks again for your inputs here. I have a PR for this here:
https://github.com/apache/hbase/pull/4557


On Mon, Jun 20, 2022 at 5:57 PM Bryan Beaudreault <[email protected]>
wrote:

> Actually, it looks like hbase.rpc.timeout currently applies to the
> openScanner call (which is all that's necessary for most meta scans, since
> they are small). So I think we do also need an
> hbase.client.meta.rpc.timeout config after all.
>
> On Mon, Jun 20, 2022 at 4:17 PM Bryan Beaudreault <
> [email protected]> wrote:
>
>> Thank you both for the input. I will get a PR up for that shortly.
>>
>> Related, I also filed https://issues.apache.org/jira/browse/HBASE-27142
>> for branch-2 blocking client -- "Scanner timeout should take precedence
>> over rpc timeout". I noticed that you changed this behavior for the async
>> client a few years ago Duo, and I think it makes sense to do for the
>> blocking client. Otherwise setting a special meta scanner timeout won't
>> really take effect unless we also provide a special meta rpc timeout. Per
>> Andrew's comment (which I 100% agree), it seems better to unify the clients
>> than to create another new config.
>>
>> On Mon, Jun 20, 2022 at 12:46 PM Andrew Purtell <[email protected]>
>> wrote:
>>
>>> Our default position should be to resist adding new configuration
>>> variables, but in this case, I think it makes sense.
>>> +1 for adding a distinct timeout setting for meta. Definitely a valid
>>> special case.
>>>
>>> On Mon, Jun 20, 2022 at 9:09 AM 张铎(Duo Zhang) <[email protected]>
>>> wrote:
>>>
>>> > You can see the comments at the top of the method, on why we do not
>>> honor
>>> > the rpc timeout, and also not the operation timeout.
>>> >
>>> > So here maybe we should introduce a special scan timeout for the meta
>>> > table?
>>> >
>>> > Bryan Beaudreault <[email protected]> 于2022年6月20日周一
>>> > 23:45写道：
>>> >
>>> > > Hi Duo, just getting back to this. Thanks for your response.
>>> > >
>>> > > Actually I'm pretty sure there is a simple retry for all scanner next
>>> > > calls. In master branch this occurs
>>> > > in AsyncScanSingleRegionRpcRetryingCaller#call(), which is called
>>> from
>>> > > #next(). The stub.scan() call in call() passes a callback onComplete
>>> > which
>>> > > includes an error handling call of onError. In onError, a retry is
>>> > > scheduled at the end of the method which calls call() again. See
>>> > >
>>> > >
>>> >
>>> https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584
>>> <https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/AsyncScanSingleRegionRpcRetryingCaller.java#L584>
>>> > > .
>>> > > Let me know if I'm missing something. Similar logic in branch-2
>>> blocking
>>> > > client.
>>> > >
>>> > > But anyway, most meta calls are small scans which return their
>>> results in
>>> > > the openScanner call anyway. So improperly tuned rpc timeouts (too
>>> short)
>>> > > can cause retries in openScanner, and probably next() as well if
>>> > > applicable.
>>> > >
>>> > > I took another look and we do not have any special
>>> > > hbase.client.scanner.timeout or hbase.rpc.timeout for meta. Unless
>>> I'm
>>> > > missing something in the link above, I'm going to move forward adding
>>> > these
>>> > > in the jira.
>>> > >
>>> > > On Tue, May 31, 2022 at 8:55 PM 张铎(Duo Zhang) <[email protected]
>>> >
>>> > > wrote:
>>> > >
>>> > > > Scan will not honor operation timeout configuration as its logic
>>> is a
>>> > bit
>>> > > > different compared to normal read/write operations.
>>> > > >
>>> > > > For scan, usually there is no simple 'retry'(except the open
>>> scanner
>>> > > call),
>>> > > > if you hit an error, usually you need to restart the scan by
>>> making a
>>> > new
>>> > > > open scanner call, not retry on the scanner next call.
>>> > > >
>>> > > > IIRC we have a special hbase.client.scanner.timeout.period and
>>> also a
>>> > > > special hbase.rpc.timeout for meta?
>>> > > >
>>> > > > Thanks.
>>> > > >
>>> > > > Bryan Beaudreault <[email protected]> 于2022年6月1日周三
>>> > > 00:47写道：
>>> > > >
>>> > > > > Hi all,
>>> > > > >
>>> > > > > We just had a production issue where a user-facing API service
>>> had a
>>> > > low
>>> > > > > hbase.rpc.timeout, and this majorly contributed to a meta
>>> hotspotting
>>> > > > > issue. The issue is, user requests can only be submitted once the
>>> > > > necessary
>>> > > > > RegionLocation is in the MetaCache. But in a meta hotspotting
>>> > scenario
>>> > > it
>>> > > > > may be impossible to return a RegionLocation for hbase:meta in a
>>> > timely
>>> > > > > manner. This will trigger the rpc timeout, which may result in a
>>> > number
>>> > > > of
>>> > > > > retries. This retry storm (across many client instances) can
>>> further
>>> > > > > exacerbate meta hotspotting issues.
>>> > > > >
>>> > > > > My thought is to decouple meta rpc timeout from user rpc
>>> timeouts,
>>> > > > because
>>> > > > > generally you would prefer to allow a longer meta request to
>>> succeed
>>> > > > > because it may unblock many user requests.
>>> > > > >
>>> > > > > I think our current timeouts for meta scans are a bit confusing.
>>> > > There's
>>> > > > > a hbase.client.meta.operation.timeout, but actually that does not
>>> > apply
>>> > > > to
>>> > > > > meta scans. Instead they are configured via hbase.rpc.timeout
>>> > > > > and hbase.client.scanner.timeout.period.
>>> > > > >
>>> > > > > I was considering special casing meta scans so that they are
>>> > configured
>>> > > > via
>>> > > > > (new) hbase.client.meta.rpc.timeout and (existing)
>>> > > > > hbase.client.meta.operation.timeout. This would be different from
>>> > > typical
>>> > > > > scan requests, but may be more intuitive overall? Does anyone
>>> have
>>> > any
>>> > > > > opinions?
>>> > > > >
>>> > > > > See https://issues.apache.org/jira/browse/HBASE-27078
>>> <https://issues.apache.org/jira/browse/HBASE-27078>
>>> > > > <https://issues.apache.org/jira/browse/HBASE-27078
>>> <https://issues.apache.org/jira/browse/HBASE-27078>
>>> >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>>
>>> --
>>> Best regards,
>>> Andrew
>>>
>>> Unrest, ignorance distilled, nihilistic imbeciles -
>>> It's what we’ve earned
>>> Welcome, apocalypse, what’s taken you so long?
>>> Bring us the fitting end that we’ve been counting on
>>> - A23, Welcome, Apocalypse
>>>
>>

Re: Separately configurable client meta rpc timeout

Reply via email to