Hi all, We just had a production issue where a user-facing API service had a low hbase.rpc.timeout, and this majorly contributed to a meta hotspotting issue. The issue is, user requests can only be submitted once the necessary RegionLocation is in the MetaCache. But in a meta hotspotting scenario it may be impossible to return a RegionLocation for hbase:meta in a timely manner. This will trigger the rpc timeout, which may result in a number of retries. This retry storm (across many client instances) can further exacerbate meta hotspotting issues.
My thought is to decouple meta rpc timeout from user rpc timeouts, because generally you would prefer to allow a longer meta request to succeed because it may unblock many user requests. I think our current timeouts for meta scans are a bit confusing. There's a hbase.client.meta.operation.timeout, but actually that does not apply to meta scans. Instead they are configured via hbase.rpc.timeout and hbase.client.scanner.timeout.period. I was considering special casing meta scans so that they are configured via (new) hbase.client.meta.rpc.timeout and (existing) hbase.client.meta.operation.timeout. This would be different from typical scan requests, but may be more intuitive overall? Does anyone have any opinions? See https://issues.apache.org/jira/browse/HBASE-27078
