[GitHub] [hbase] busbey commented on a change in pull request #2084: HBASE-22263 Master creates duplicate ServerCrashProcedure on initiali…

GitBox Sat, 18 Jul 2020 22:03:01 -0700


busbey commented on a change in pull request #2084:
URL: https://github.com/apache/hbase/pull/2084#discussion_r456858152




##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -846,10 +849,37 @@ private void 
finishActiveMasterInitialization(MonitoredTask status)
     if (isStopped()) return;
 
     status.setStatus("Submitting log splitting work for previously failed 
region servers");
+
+    // grab the list of procedures once. SCP fom pre-crash should all be 
loaded, and can't progress
+    // until AM joins the cluster any SCPs that got added after we get the log 
folder list should be
+    // for a different start code.
+    final Set<ServerName> alreadyHasSCP = new HashSet<>();
+    long scpCount = 0;
+    for (ProcedureInfo procInfo : this.procedureExecutor.listProcedures() ) {
+      final Procedure proc = 
this.procedureExecutor.getProcedure(procInfo.getProcId());
+      if (proc != null) {
+        if (proc instanceof ServerCrashProcedure && !(proc.isFinished() || 
proc.isSuccess())) {
+          scpCount++;
+          alreadyHasSCP.add(((ServerCrashProcedure)proc).getServerName());
+        }
+      }
+    }
+    LOG.info("Restored proceduces include " + scpCount + " SCP covering " + 
alreadyHasSCP.size() +
+        " ServerName.");
+    
+ 
+    LOG.info("Checking " + previouslyFailedServers.size() + " previously 
failed servers (seen via wals) for existing SCP.");
+    // AM should be in "not yet init" and these should all be queued
     // Master has recovered hbase:meta region server and we put
     // other failed region servers in a queue to be handled later by SSH
     for (ServerName tmpServer : previouslyFailedServers) {
-      this.serverManager.processDeadServer(tmpServer, true);
+      if (alreadyHasSCP.contains(tmpServer)) {
+        LOG.info("Skipping failed server in FS because it already has a queued 
SCP: " + tmpServer);
+        this.serverManager.getDeadServers().add(tmpServer);

Review comment:
       does a queued SCP imply that the server should already be in the dead 
servers list? Or do we only add servers to that when we create an SCP and not 
when we recover them?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -846,10 +849,37 @@ private void 
finishActiveMasterInitialization(MonitoredTask status)
     if (isStopped()) return;
 
     status.setStatus("Submitting log splitting work for previously failed 
region servers");
+
+    // grab the list of procedures once. SCP fom pre-crash should all be 
loaded, and can't progress
+    // until AM joins the cluster any SCPs that got added after we get the log 
folder list should be
+    // for a different start code.
+    final Set<ServerName> alreadyHasSCP = new HashSet<>();
+    long scpCount = 0;
+    for (ProcedureInfo procInfo : this.procedureExecutor.listProcedures() ) {
+      final Procedure proc = 
this.procedureExecutor.getProcedure(procInfo.getProcId());
+      if (proc != null) {
+        if (proc instanceof ServerCrashProcedure && !(proc.isFinished() || 
proc.isSuccess())) {
+          scpCount++;
+          alreadyHasSCP.add(((ServerCrashProcedure)proc).getServerName());
+        }
+      }
+    }
+    LOG.info("Restored proceduces include " + scpCount + " SCP covering " + 
alreadyHasSCP.size() +
+        " ServerName.");
+    
+ 
+    LOG.info("Checking " + previouslyFailedServers.size() + " previously 
failed servers (seen via wals) for existing SCP.");
+    // AM should be in "not yet init" and these should all be queued
     // Master has recovered hbase:meta region server and we put
     // other failed region servers in a queue to be handled later by SSH
     for (ServerName tmpServer : previouslyFailedServers) {
-      this.serverManager.processDeadServer(tmpServer, true);
+      if (alreadyHasSCP.contains(tmpServer)) {
+        LOG.info("Skipping failed server in FS because it already has a queued 
SCP: " + tmpServer);
+        this.serverManager.getDeadServers().add(tmpServer);

Review comment:
       this looks like what's different from my old patch, is that right? have 
I missed anything else?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
##########
@@ -681,11 +681,15 @@ public synchronized void processDeadServer(final 
ServerName serverName, boolean
     // the handler threads and meta table could not be re-assigned in case
     // the corresponding server is down. So we queue them up here instead.
     if (!services.getAssignmentManager().isFailoverCleanupDone()) {
+      LOG.debug("AssignmentManager isn't done cleaning up from failover. 
Requeue server " + serverName);
       requeuedDeadServers.put(serverName, shouldSplitWal);
       return;
     }
 
+    // we don't chck if deadservers already included?
+    // when a server is already in the dead server list (including start code) 
do we need to schedule an SCP?

Review comment:
       these comments should be removed.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
##########
@@ -681,11 +681,15 @@ public synchronized void processDeadServer(final 
ServerName serverName, boolean
     // the handler threads and meta table could not be re-assigned in case
     // the corresponding server is down. So we queue them up here instead.
     if (!services.getAssignmentManager().isFailoverCleanupDone()) {
+      LOG.debug("AssignmentManager isn't done cleaning up from failover. 
Requeue server " + serverName);
       requeuedDeadServers.put(serverName, shouldSplitWal);
       return;
     }
 
+    // we don't chck if deadservers already included?
+    // when a server is already in the dead server list (including start code) 
do we need to schedule an SCP?
     this.deadservers.add(serverName);
+    // scheduled an SCP means AM must be going?

Review comment:
       this one should be removed too

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -860,6 +890,8 @@ private void finishActiveMasterInitialization(MonitoredTask 
status)
 
     // Fix up assignment manager status
     status.setStatus("Starting assignment manager");
+    // die somewhere in here for SCP flood I think.

Review comment:
       we can leave out the speculative comments now?

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
##########
@@ -2733,6 +2766,7 @@ public boolean isServerCrashProcessingEnabled() {
     return serverCrashProcessingEnabled.isReady();
   }
 
+  // XXX what

Review comment:
       lol. I still find this curious, but I do not think we need the comment 
now.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java
##########
@@ -426,6 +439,9 @@ private void prepareLogReplay(final MasterProcedureEnv env, 
final Set<HRegionInf
     MasterFileSystem mfs = env.getMasterServices().getMasterFileSystem();
     AssignmentManager am = env.getMasterServices().getAssignmentManager();
     mfs.prepareLogReplay(this.serverName, regions);
+    // If the master doesn't fail, we'll set this again in SERVER_CRASH_ASSIGN
+    // can we skip doing it here? depends on how fast another observer
+    // needs to see that things were processed since we yield between now and 
then.

Review comment:
       what's the word on these two cases? could we just set the log split 
state in SERVER_CRASH_ASSIGN? who else looks at wether or not the logs were 
processed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hbase] busbey commented on a change in pull request #2084: HBASE-22263 Master creates duplicate ServerCrashProcedure on initiali…

Reply via email to