Rushabh Shah created HBASE-27383:
------------------------------------
Summary: Add dead region server to SplitLogManager#deadWorkers set
as the first step.
Key: HBASE-27383
URL: https://issues.apache.org/jira/browse/HBASE-27383
Project: HBase
Issue Type: Bug
Affects Versions: 1.7.2, 1.6.0
Reporter: Rushabh Shah
Assignee: Rushabh Shah
Currently we add a dead region server to +SplitLogManager#deadWorkers+ set in
SERVER_CRASH_SPLIT_LOGS state.
Consider a case where a region server is handling split log task for hbase:meta
table and SplitLogManager has exhausted all the retries and won't try any more
region server.
The region server which is handling split log task has died.
We have a check in SplitLogManager where if a region server is declared dead
and if that region server is responsible for split log task then we forcefully
resubmit split log task. See the code
[here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java#L721-L726].
But we add a region server to SplitLogManager#deadWorkers set in
[SERVER_CRASH_SPLIT_LOGS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L252]
state.
Before that it runs
[SERVER_CRASH_GET_REGIONS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L214]
state and checks if hbase:meta table is up. In this case, hbase:meta table
was not online and that prevented SplitLogManager to add this RS to deadWorkers
list. This created a deadlock and hbase cluster was completely down for an
extended period of time until we failed over active hmaster. See HBASE-27382
for more details.
Improvements:
1. We should a dead region server to +SplitLogManager#deadWorkers+ list as the
first step.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)