[ 
https://issues.apache.org/jira/browse/HBASE-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997200#comment-15997200
 ] 

Josh Elser commented on HBASE-16488:
------------------------------------

{noformat}
@@ -2599,11 +2625,26 @@ public class HMaster extends HRegionServer implements 
MasterServices, Server {
 
   void checkNamespaceManagerReady() throws IOException {
     checkInitialized();
-    if (tableNamespaceManager == null ||
-        !tableNamespaceManager.isTableAvailableAndInitialized(true)) {
+    if (tableNamespaceManager == null) {
       throw new IOException("Table Namespace Manager not ready yet, try again 
later");
+    } else if (!tableNamespaceManager.isTableAvailableAndInitialized(true)) {
+      try {
+        // Wait some time.
+        long startTime = EnvironmentEdgeManager.currentTime();
+        int timeout = conf.getInt("hbase.master.namespace.waitforready", 
30000);
+        while (!tableNamespaceManager.isTableNamespaceManagerStarted() &&
+            EnvironmentEdgeManager.currentTime() - startTime < timeout) {
+          Thread.sleep(100);
+        }
+      } catch (InterruptedException e) {
+        throw (InterruptedIOException) new 
InterruptedIOException().initCause(e);
+      }
+      if (!tableNamespaceManager.isTableNamespaceManagerStarted()) {
+        throw new IOException("Table Namespace Manager not fully initialized, 
try again later");
+      }
     }
   }
{noformat}

This sits a little funny with me. Ideally, we'd have the caller do the sleeping 
so that we're not blocking a thread inside of the Master (or worse an RPC 
handler). Your change here is definitely easier to implement, but I wonder how 
hard it would be to leave the exception throw and implement retry logic in the 
callers (other methods in HMaster or hbase client).

Unrelated: shouldn't {{tableNamespaceManager}} be volatile if we're checking it 
across different threads? Or, make it final and use an {{AtomicReference}}?

{noformat}

@@ -93,7 +94,7 @@ public class TableNamespaceManager {
       long startTime = EnvironmentEdgeManager.currentTime();
       int timeout = conf.getInt(NS_INIT_TIMEOUT, DEFAULT_NS_INIT_TIMEOUT);
       while (!isTableAvailableAndInitialized(false)) {
-        if (EnvironmentEdgeManager.currentTime() - startTime + 100 > timeout) {
+        if (EnvironmentEdgeManager.currentTime() - startTime > timeout) {
           // We can't do anything if ns is not online.
           throw new IOException("Timedout " + timeout + "ms waiting for 
namespace table to " +
             "be assigned");
{noformat}

Do you know of the reason we were previously augmenting this "runtime" by 100ms?

{noformat}
diff --git 
hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 
hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
index f60be66..c75d4bc 100644
--- hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
+++ hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
@@ -105,6 +105,7 @@ import org.apache.hadoop.hbase.security.HBaseKerberosUtils;
 import org.apache.hadoop.hbase.security.User;
 import org.apache.hadoop.hbase.security.visibility.VisibilityLabelsCache;
 import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hadoop.hbase.util.EnvironmentEdgeManager;
 import org.apache.hadoop.hbase.util.FSTableDescriptors;
 import org.apache.hadoop.hbase.util.FSUtils;
 import org.apache.hadoop.hbase.util.JVMClusterUtil;
@@ -1459,6 +1460,7 @@ public class HBaseTestingUtility extends 
HBaseCommonTestingUtility {
           .setMaxVersions(numVersions);
       desc.addFamily(hcd);
     }
+    waitUntilTableNamespaceManagerStarted();
     getHBaseAdmin().createTable(desc, startKey, endKey, numRegions);
     // HBaseAdmin only waits for regions to appear in hbase:meta we should 
wait until they are assigned
     waitUntilAllRegionsAssigned(tableName);
@@ -1497,6 +1499,7 @@ public class HBaseTestingUtility extends 
HBaseCommonTestingUtility {
       hcd.setBloomFilterType(BloomType.NONE);
       htd.addFamily(hcd);
     }
+    waitUntilTableNamespaceManagerStarted();
     getHBaseAdmin().createTable(htd, splitKeys);
     // HBaseAdmin only waits for regions to appear in hbase:meta we should 
wait until they are
     // assigned
{noformat}

Do this once in {{MiniHBaseCluster startMiniHBaseCluster(int, int, Class, 
Class, boolean, boolean)}} instead of having it littered across 
HBaseTestingUtility?

Nice test additions!

> Starting namespace and quota services in master startup asynchronizely
> ----------------------------------------------------------------------
>
>                 Key: HBASE-16488
>                 URL: https://issues.apache.org/jira/browse/HBASE-16488
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 2.0.0, 1.3.0, 1.0.3, 1.4.0, 1.1.5, 1.2.2
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>         Attachments: HBASE-16488.v1-branch-1.patch, 
> HBASE-16488.v1-master.patch, HBASE-16488.v2-branch-1.patch, 
> HBASE-16488.v2-branch-1.patch, HBASE-16488.v3-branch-1.patch, 
> HBASE-16488.v3-branch-1.patch, HBASE-16488.v4-branch-1.patch, 
> HBASE-16488.v5-branch-1.patch, HBASE-16488.v6-branch-1.patch
>
>
> From time to time, during internal IT test and from customer, we often see 
> master initialization failed due to namespace table region takes long time to 
> assign (eg. sometimes split log takes long time or hanging; or sometimes RS 
> is temporarily not available; sometimes due to some unknown assignment 
> issue).  In the past, there was some proposal to improve this situation, eg. 
> HBASE-13556 / HBASE-14190 (Assign system tables ahead of user region 
> assignment) or HBASE-13557 (Special WAL handling for system tables) or  
> HBASE-14623 (Implement dedicated WAL for system tables).  
> This JIRA proposes another way to solve this master initialization fail 
> issue: namespace service is only used by a handful operations (eg. create 
> table / namespace DDL / get namespace API / some RS group DDL).  Only quota 
> manager depends on it and quota management is off by default.  Therefore, 
> namespace service is not really needed for master to be functional.  So we 
> could start namespace service asynchronizely without blocking master startup.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to