[ https://issues.apache.org/jira/browse/HBASE-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997200#comment-15997200 ]
Josh Elser commented on HBASE-16488: ------------------------------------ {noformat} @@ -2599,11 +2625,26 @@ public class HMaster extends HRegionServer implements MasterServices, Server { void checkNamespaceManagerReady() throws IOException { checkInitialized(); - if (tableNamespaceManager == null || - !tableNamespaceManager.isTableAvailableAndInitialized(true)) { + if (tableNamespaceManager == null) { throw new IOException("Table Namespace Manager not ready yet, try again later"); + } else if (!tableNamespaceManager.isTableAvailableAndInitialized(true)) { + try { + // Wait some time. + long startTime = EnvironmentEdgeManager.currentTime(); + int timeout = conf.getInt("hbase.master.namespace.waitforready", 30000); + while (!tableNamespaceManager.isTableNamespaceManagerStarted() && + EnvironmentEdgeManager.currentTime() - startTime < timeout) { + Thread.sleep(100); + } + } catch (InterruptedException e) { + throw (InterruptedIOException) new InterruptedIOException().initCause(e); + } + if (!tableNamespaceManager.isTableNamespaceManagerStarted()) { + throw new IOException("Table Namespace Manager not fully initialized, try again later"); + } } } {noformat} This sits a little funny with me. Ideally, we'd have the caller do the sleeping so that we're not blocking a thread inside of the Master (or worse an RPC handler). Your change here is definitely easier to implement, but I wonder how hard it would be to leave the exception throw and implement retry logic in the callers (other methods in HMaster or hbase client). Unrelated: shouldn't {{tableNamespaceManager}} be volatile if we're checking it across different threads? Or, make it final and use an {{AtomicReference}}? {noformat} @@ -93,7 +94,7 @@ public class TableNamespaceManager { long startTime = EnvironmentEdgeManager.currentTime(); int timeout = conf.getInt(NS_INIT_TIMEOUT, DEFAULT_NS_INIT_TIMEOUT); while (!isTableAvailableAndInitialized(false)) { - if (EnvironmentEdgeManager.currentTime() - startTime + 100 > timeout) { + if (EnvironmentEdgeManager.currentTime() - startTime > timeout) { // We can't do anything if ns is not online. throw new IOException("Timedout " + timeout + "ms waiting for namespace table to " + "be assigned"); {noformat} Do you know of the reason we were previously augmenting this "runtime" by 100ms? {noformat} diff --git hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java index f60be66..c75d4bc 100644 --- hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java +++ hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java @@ -105,6 +105,7 @@ import org.apache.hadoop.hbase.security.HBaseKerberosUtils; import org.apache.hadoop.hbase.security.User; import org.apache.hadoop.hbase.security.visibility.VisibilityLabelsCache; import org.apache.hadoop.hbase.util.Bytes; +import org.apache.hadoop.hbase.util.EnvironmentEdgeManager; import org.apache.hadoop.hbase.util.FSTableDescriptors; import org.apache.hadoop.hbase.util.FSUtils; import org.apache.hadoop.hbase.util.JVMClusterUtil; @@ -1459,6 +1460,7 @@ public class HBaseTestingUtility extends HBaseCommonTestingUtility { .setMaxVersions(numVersions); desc.addFamily(hcd); } + waitUntilTableNamespaceManagerStarted(); getHBaseAdmin().createTable(desc, startKey, endKey, numRegions); // HBaseAdmin only waits for regions to appear in hbase:meta we should wait until they are assigned waitUntilAllRegionsAssigned(tableName); @@ -1497,6 +1499,7 @@ public class HBaseTestingUtility extends HBaseCommonTestingUtility { hcd.setBloomFilterType(BloomType.NONE); htd.addFamily(hcd); } + waitUntilTableNamespaceManagerStarted(); getHBaseAdmin().createTable(htd, splitKeys); // HBaseAdmin only waits for regions to appear in hbase:meta we should wait until they are // assigned {noformat} Do this once in {{MiniHBaseCluster startMiniHBaseCluster(int, int, Class, Class, boolean, boolean)}} instead of having it littered across HBaseTestingUtility? Nice test additions! > Starting namespace and quota services in master startup asynchronizely > ---------------------------------------------------------------------- > > Key: HBASE-16488 > URL: https://issues.apache.org/jira/browse/HBASE-16488 > Project: HBase > Issue Type: Improvement > Components: master > Affects Versions: 2.0.0, 1.3.0, 1.0.3, 1.4.0, 1.1.5, 1.2.2 > Reporter: Stephen Yuan Jiang > Assignee: Stephen Yuan Jiang > Attachments: HBASE-16488.v1-branch-1.patch, > HBASE-16488.v1-master.patch, HBASE-16488.v2-branch-1.patch, > HBASE-16488.v2-branch-1.patch, HBASE-16488.v3-branch-1.patch, > HBASE-16488.v3-branch-1.patch, HBASE-16488.v4-branch-1.patch, > HBASE-16488.v5-branch-1.patch, HBASE-16488.v6-branch-1.patch > > > From time to time, during internal IT test and from customer, we often see > master initialization failed due to namespace table region takes long time to > assign (eg. sometimes split log takes long time or hanging; or sometimes RS > is temporarily not available; sometimes due to some unknown assignment > issue). In the past, there was some proposal to improve this situation, eg. > HBASE-13556 / HBASE-14190 (Assign system tables ahead of user region > assignment) or HBASE-13557 (Special WAL handling for system tables) or > HBASE-14623 (Implement dedicated WAL for system tables). > This JIRA proposes another way to solve this master initialization fail > issue: namespace service is only used by a handful operations (eg. create > table / namespace DDL / get namespace API / some RS group DDL). Only quota > manager depends on it and quota management is off by default. Therefore, > namespace service is not really needed for master to be functional. So we > could start namespace service asynchronizely without blocking master startup. > -- This message was sent by Atlassian JIRA (v6.3.15#6346)