[ 
https://issues.apache.org/jira/browse/HBASE-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Agashe updated HBASE-18261:
---------------------------------
    Comment: was deleted

(was: Hi [~stack], [~yangzhe1991]:

FWICS here is the root cause:

The UT tests ServerCrashProcedure when RS carrying meta region crashes. It also 
simulates master crash after executing each step in the procedure.

Initially all RS are at the same version i.e. 3.0.0-SNAPSHOT. 
HMaster.getRegionServerVersion() returns version 0.0.0 for dead RS (carrying 
meta). This makes AssignmentManager.getExcludedServersForSystemTable() return 
non-empty list and the logic in 
AssignmentManager.checkIfShouldMoveSystemRegionAsync() is triggered which in 
turn submits MoveRegionProcedure to move meta region from RS with version 0.0.0 
to one of other RS with latest version.

As commented before this causes race condition between scan and 
MoveRegionProcedure.

AssignmentManager.getExcludedServersForSystemTable() uses 
master.getServerManager().getOnlineServersList() to get list of online servers 
only. But on further scrutiny of code and logs I found that server can be 
online and dead at the same time!

IMO, 
* Currently meta is re/assigned from ServerCrashProcedure, during master 
initialization from MasterMetaBootstrap and followed by in 
checkIfShouldMoveSystemRegionAsync().
* that means meta re/assignment may be attempted at max 3 times in certain 
conditions.
* I am working on HBASE-18261 to have meta recovery/ assignment logic at one 
place.
* I think we can pull these changes for assigning meta to RS with highest 
version number there.
* This will result in, RS with highest version number will be considered for 
meta region assignment when:
# When meta region carrying RS crashes
# During master startup

Along with above changes, obviously we need to fix 
ServerManager.isServerOnline() and ServerManager.isServerDead() returning true 
at the same time. This could be result of test code simulating crash but the 
class itself should not allow this case (IMHO).

I have a following fix ready (and tested) which will fix the test but I don't 
consider it a long term fix.
{code}
diff --git 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
index 046612a..1a2d53b 100644
--- 
a/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
+++ 
b/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
@@ -1760,6 +1760,7 @@ public class AssignmentManager implements ServerListener {
   public List<ServerName> getExcludedServersForSystemTable() {
     List<Pair<ServerName, String>> serverList = 
master.getServerManager().getOnlineServersList()
         .stream()
+        .filter((s)->!master.getServerManager().isServerDead(s))
         .map((s)->new Pair<>(s, master.getRegionServerVersion(s)))
         .collect(Collectors.toList());
     if (serverList.isEmpty()) {
{code}

[~stack], as you have suggested, we can disable the test for now. When we agree 
on fix, we can enable it. Let me know your thoughts. Thanks!)

> [AMv2] Create new RecoverMetaProcedure and use it from ServerCrashProcedure 
> and HMaster.finishActiveMasterInitialization()
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-18261
>                 URL: https://issues.apache.org/jira/browse/HBASE-18261
>             Project: HBase
>          Issue Type: Improvement
>          Components: amv2
>    Affects Versions: 2.0.0-alpha-1
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>             Fix For: 2.0.0-alpha-2
>
>         Attachments: HBASE-18261.master.001.patch
>
>
> When unit test 
> hbase.master.procedure.TestServerCrashProcedure#testRecoveryAndDoubleExecutionOnRsWithMeta()
>  is enabled and run several times, it fails intermittently. Cause is meta 
> recovery is done at two different places:
> * ServerCrashProcedure.processMeta()
> * HMaster.finishActiveMasterInitialization()
> and its not coordinated.
> When HMaster.finishActiveMasterInitialization() gets to submit splitMetaLog() 
> first and while its running call from ServerCrashProcedure.processMeta() 
> fails causing step to be retried again in a loop.
> When ServerCrashProcedure.processMeta() submits splitMetaLog after 
> splitMetaLog from HMaster.finishActiveMasterInitialization() is finished, 
> success is returned without doing any work.
> But if ServerCrashProcedure.processMeta() submits splitMetaLog request and 
> while its going HMaster.finishActiveMasterInitialization() submits it test 
> fails with exception.
> [~stack] and I discussed the possible solution:
> Create RecoverMetaProcedure and call it where required. Procedure framework 
> provides mutual exclusion and requires idempotence, which should fix the 
> problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to