[jira] [Created] (HBASE-27975) Region (un)assignment should have a more direct timeout

2023-07-14 Thread Ray Mattingly (Jira)
Ray Mattingly created HBASE-27975:
-

 Summary: Region (un)assignment should have a more direct timeout
 Key: HBASE-27975
 URL: https://issues.apache.org/jira/browse/HBASE-27975
 Project: HBase
  Issue Type: Improvement
Reporter: Ray Mattingly


h3. Problem

We've observed a few cases in which region (un)assignment can hang for 
significant, and sometimes seemingly indefinite, periods of time. This results 
in unpredictably long downtime which must be remediated via manually initiated 
ServerCrashProcedures.
h3. Example 1

If a RS is unable to communicate with the NameNode and it is asked to close a 
region then its RS_CLOSE_REGION thread will get stuck awaiting a NN failover. 
Due to several default configurations of options like:
 * hbase.hstore.flush.retries.number
 * hbase.server.pause
 * dfs.client.failover.max.attempts
 * dfs.client.failover.sleep.base.millis
 * dfs.client.failover.max.attempts

this region unassignment attempt will hang for approximately 30 minutes before 
it allows the failure to bubble up and automatically trigger a 
ServerCrashProcedure.

One can tune the aforementioned options to reduce the TTR here, but it's not a 
very obvious/direct solution.
h3. Example 2

In rare cases our public cloud provider may supply us with machines that have 
degraded hardware. If we're unable to catch this degradation prior to startup, 
then we've observed that the degraded RegionServer process may come online; as 
a result it will be assigned regions which can often never actually be 
successfully opened. If the RegionServer's assignment handling fails to 
intentionally fail, then there will never be outside intervention; the 
assignment will be stuck hanging indefinitely. I've written [a unit 
test|https://github.com/apache/hbase/compare/master...HubSpot:hbase:rsit-opening-repro]
 which reproduces this behavior. On this same branch is a unit test 
demonstrating that a timeout placed on the AssignRegionHandler helps to fast 
fail and reliably trigger the necessary ServerCrashProcedure.
h3. Proposal

I want to propose that we add optional and configurable timeouts to the 
AssignRegion and UnassignRegion event handlers.

This would allow us to much more intentionally & clearly prevent long running 
retries for these downtime inducing procedures and could consequently improve 
our reliability in both examples.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27974) CompactionServer cause the loss of HFile references in snapshot

2023-07-14 Thread Zhuoyue Huang (Jira)
Zhuoyue Huang created HBASE-27974:
-

 Summary: CompactionServer cause the loss of HFile references in 
snapshot
 Key: HBASE-27974
 URL: https://issues.apache.org/jira/browse/HBASE-27974
 Project: HBase
  Issue Type: Sub-task
Reporter: Zhuoyue Huang
Assignee: Zhuoyue Huang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)