Hi all,

We just had a production issue where a user-facing API service had a low
hbase.rpc.timeout, and this majorly contributed to a meta hotspotting
issue. The issue is, user requests can only be submitted once the necessary
RegionLocation is in the MetaCache. But in a meta hotspotting scenario it
may be impossible to return a RegionLocation for hbase:meta in a timely
manner. This will trigger the rpc timeout, which may result in a number of
retries. This retry storm (across many client instances) can further
exacerbate meta hotspotting issues.

My thought is to decouple meta rpc timeout from user rpc timeouts, because
generally you would prefer to allow a longer meta request to succeed
because it may unblock many user requests.

I think our current timeouts for meta scans are a bit confusing. There's
a hbase.client.meta.operation.timeout, but actually that does not apply to
meta scans. Instead they are configured via hbase.rpc.timeout
and hbase.client.scanner.timeout.period.

I was considering special casing meta scans so that they are configured via
(new) hbase.client.meta.rpc.timeout and (existing)
hbase.client.meta.operation.timeout. This would be different from typical
scan requests, but may be more intuitive overall? Does anyone have any
opinions?

See https://issues.apache.org/jira/browse/HBASE-27078

Reply via email to