Ray Mattingly created HBASE-27975:
-
Summary: Region (un)assignment should have a more direct timeout
Key: HBASE-27975
URL: https://issues.apache.org/jira/browse/HBASE-27975
Project: HBase
Issue Type: Improvement
Reporter: Ray Mattingly
h3. Problem
We've observed a few cases in which region (un)assignment can hang for
significant, and sometimes seemingly indefinite, periods of time. This results
in unpredictably long downtime which must be remediated via manually initiated
ServerCrashProcedures.
h3. Example 1
If a RS is unable to communicate with the NameNode and it is asked to close a
region then its RS_CLOSE_REGION thread will get stuck awaiting a NN failover.
Due to several default configurations of options like:
* hbase.hstore.flush.retries.number
* hbase.server.pause
* dfs.client.failover.max.attempts
* dfs.client.failover.sleep.base.millis
* dfs.client.failover.max.attempts
this region unassignment attempt will hang for approximately 30 minutes before
it allows the failure to bubble up and automatically trigger a
ServerCrashProcedure.
One can tune the aforementioned options to reduce the TTR here, but it's not a
very obvious/direct solution.
h3. Example 2
In rare cases our public cloud provider may supply us with machines that have
degraded hardware. If we're unable to catch this degradation prior to startup,
then we've observed that the degraded RegionServer process may come online; as
a result it will be assigned regions which can often never actually be
successfully opened. If the RegionServer's assignment handling fails to
intentionally fail, then there will never be outside intervention; the
assignment will be stuck hanging indefinitely. I've written [a unit
test|https://github.com/apache/hbase/compare/master...HubSpot:hbase:rsit-opening-repro]
which reproduces this behavior. On this same branch is a unit test
demonstrating that a timeout placed on the AssignRegionHandler helps to fast
fail and reliably trigger the necessary ServerCrashProcedure.
h3. Proposal
I want to propose that we add optional and configurable timeouts to the
AssignRegion and UnassignRegion event handlers.
This would allow us to much more intentionally & clearly prevent long running
retries for these downtime inducing procedures and could consequently improve
our reliability in both examples.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)