[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967881#comment-14967881 ]
Haohui Mai edited comment on HDFS-9129 at 10/21/15 8:59 PM: ------------------------------------------------------------ Thanks for the work, [~liuml07] {code} + if (status == BMSafeModeStatus.OFF) { + return; + } + if (status == BMSafeModeStatus.OFF) { + return; + } {code} There are multiple cases like the above. They should be asserts. {code} + synchronized boolean isSafeModeTrackingBlocks() { + return namesystem.isHaEnabled() && shouldIncrementallyTrackBlocks; + } {code} The code can be simplified. {{namesystem.isHaEnabled()}} does not change in the lifecycle of the process. {code} + /** + * The transition trigger of the safe mode state machine. + * + * The state machine elaborates itself in code. Specially the start status is + * always INITIALIZED and the end status is OFF. There is no transition to + * INITIALIZED and no transition from OFF. + */ {code} It's better to document the conditions of state transition. {code} + needExtension = extension > 0 && + (blockThreshold > 0 || datanodeThreshold > 0); {code} This can be moved under the THRESHOLD statement and become a local variable. {code} + if (areThresholdsMet()) { + if (needExtension) { + // THRESHOLD -> EXTENSION + status = BMSafeModeStatus.EXTENSION; + reached = monotonicNow(); + NameNode.stateChangeLog.info("STATE* Safe mode extension entered. \n" + + getSafeModeTip()); + smmthread = new Daemon(new SafeModeMonitor()); + smmthread.start(); + } else { + // THRESHOLD -> OFF + leaveSafeMode(); + } + } + initializeReplQueuesIfNecessary(); + break; + case EXTENSION: + if (areThresholdsMet() && monotonicNow() - reached >= extension) { + // EXTENSION -> OFF + leaveSafeMode(); + } + break; {code} {{initializeReplQueuesIfNecessary()}} should be called only once. {code} + blockManager.leaveSafeMode(); + safeModeStatus = SafeModeStatus.OFF; + startSecretManagerIfNecessary(); {code} {{safeModeStatus = SafeModeStatus.OFF}} should be moved to {{BlockManagerSafeMode}}. {code} + private class SafeModeMonitor implements Runnable { + /** interval in msec for checking safe mode: {@value}. */ + private static final long RECHECK_INTERVAL = 1000; + + @Override + public void run() { + while (namesystem.isRunning()) { + try { + namesystem.writeLock(); + if (status == BMSafeModeStatus.OFF) { + // We're not in start up safe mode any more while monitoring. + // Chances are we left safe mode manually before thresholds and + // extension are reached. + break; + } + checkSafeMode(); + if (!isInSafeMode()) { + smmthread = null; + break; + } + } finally { + namesystem.writeUnlock(); + } + + try { + Thread.sleep(RECHECK_INTERVAL); + } catch (InterruptedException ignored) { + } + } + + if (!namesystem.isRunning()) { + LOG.info("NameNode is being shutdown, exit SafeModeMonitor thread"); + } + } + } {code} A cleaner approach is to put the reached timestamp into the constructor of {{SafeModeMonitor()}}. It might be good to have additional unit tests for BlockManagerSafeMode. was (Author: wheat9): Thanks for the work, [~liuml07] {code} + if (status == BMSafeModeStatus.OFF) { + return; + } + if (status == BMSafeModeStatus.OFF) { + return; + } {code} There are multiple cases like the above. They should be asserts. {code} + synchronized boolean isSafeModeTrackingBlocks() { + return namesystem.isHaEnabled() && shouldIncrementallyTrackBlocks; + } {code} The code can be simplified. {{namesystem.isHaEnabled()}} does not change in the lifecycle of the process. {code} + /** + * The transition trigger of the safe mode state machine. + * + * The state machine elaborates itself in code. Specially the start status is + * always INITIALIZED and the end status is OFF. There is no transition to + * INITIALIZED and no transition from OFF. + */ {code} It's better to document the conditions of state transition. {code} + needExtension = extension > 0 && + (blockThreshold > 0 || datanodeThreshold > 0); {code} This can be moved under the THRESHOLD statement and become a local variable. {code} + if (areThresholdsMet()) { + if (needExtension) { + // THRESHOLD -> EXTENSION + status = BMSafeModeStatus.EXTENSION; + reached = monotonicNow(); + NameNode.stateChangeLog.info("STATE* Safe mode extension entered. \n" + + getSafeModeTip()); + smmthread = new Daemon(new SafeModeMonitor()); + smmthread.start(); + } else { + // THRESHOLD -> OFF + leaveSafeMode(); + } + } + initializeReplQueuesIfNecessary(); + break; + case EXTENSION: + if (areThresholdsMet() && monotonicNow() - reached >= extension) { + // EXTENSION -> OFF + leaveSafeMode(); + } + break; {code} {{initializeReplQueuesIfNecessary()}} should be called only once. {code} + blockManager.leaveSafeMode(); + safeModeStatus = SafeModeStatus.OFF; + startSecretManagerIfNecessary(); {code} {{safeModeStatus = SafeModeStatus.OFF}} should be moved to {{BlockManagerSafeMode}}. {code} + private class SafeModeMonitor implements Runnable { + /** interval in msec for checking safe mode: {@value}. */ + private static final long RECHECK_INTERVAL = 1000; + + @Override + public void run() { + while (namesystem.isRunning()) { + try { + namesystem.writeLock(); + if (status == BMSafeModeStatus.OFF) { + // We're not in start up safe mode any more while monitoring. + // Chances are we left safe mode manually before thresholds and + // extension are reached. + break; + } + checkSafeMode(); + if (!isInSafeMode()) { + smmthread = null; + break; + } + } finally { + namesystem.writeUnlock(); + } + + try { + Thread.sleep(RECHECK_INTERVAL); + } catch (InterruptedException ignored) { + } + } + + if (!namesystem.isRunning()) { + LOG.info("NameNode is being shutdown, exit SafeModeMonitor thread"); + } + } + } {code} A cleaner approach is to put the reached timestamp into the constructor of {{SafeModeMonitor()}}, and there is no need to hold the namesystem lock anymore. > Move the safemode block count into BlockManager > ----------------------------------------------- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Haohui Mai > Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, > HDFS-9129.005.patch, HDFS-9129.006.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)