[ https://issues.apache.org/jira/browse/HDFS-9129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994706#comment-14994706 ]
Jing Zhao commented on HDFS-9129: --------------------------------- The latest patch looks good to me overall. Here're my comments: # Let's still define {{BlockManagerSafeMode#status}} as private field, and provide getter/setter if necessary. In this way we can have better control of its value. Similarly for {{blockTotal}} and {{blockSafe}}. # The following two initializations may be wrong: with the patch the safemode object is created when contructing BlockManager, before loading fsimage and editlog from disk. {code} private final long startTime = monotonicNow(); {code} {code} private long lastStatusReport = monotonicNow(); {code} # {{shouldIncrementallyTrackBlocks}} is actually determined by {{haEnabled}} thus looks like it can be declared as final and {{isSafeModeTrackingBlocks}} can be simplified. # {{BlockManagerSafeMode#setBlockTotal}} currently does two things: 1) updating threshold numbers, and 2) triggering mode check. We can separate #2 out of this method, and then {{activate}} does not need to do an unnecessary check. # {{reached}} can be renamed to {{reachedTime}} # In the old safemode semantic, once entering the extension state, NN never comes back to the normal safemode state, but can keep waiting in the extension state if the threshold is not met again. The current implementation changes this semantic. It's better to avoid this change here. {code} case EXTENSION: if (!areThresholdsMet()) { // EXTENSION -> PENDING_THRESHOLD status = BMSafeModeStatus.PENDING_THRESHOLD; } {code} # The following code can be simplified. {code} if (status == BMSafeModeStatus.OFF) { return; } if (!shouldIncrementallyTrackBlocks) { return; } {code} # In {{adjustBlockTotals}}, the {{setBlockTotal}} call should be out of the synchronized block. {code} synchronized (this) { ... blockSafe += deltaSafe; setBlockTotal(blockTotal + deltaTotal); } {code} # Not caused by this patch, but since {{doConsistencyCheck}} sometimes is not protected by any lock (e.g., {{computeDatanodeWork}}), the total number of blocks retrieved from blockManager and used by the consistency check can be inaccurate. So I think here we can replace the AssertionError to a warning log message. # Let's still name the first parameter of {{incrementSafeBlockCount}} as "storageNum". # In {{decrementSafeBlockCount}}, {{checkSafeMode}} only needs to be called when the first time the live replica number drops below the safe number. Thus {{checkSafeMode}} should be called within the if. {code} if (blockManager.countNodes(b).liveReplicas() == safeReplication - 1) { this.blockSafe--; } assert blockSafe >= 0; checkSafeMode(); {code} > Move the safemode block count into BlockManager > ----------------------------------------------- > > Key: HDFS-9129 > URL: https://issues.apache.org/jira/browse/HDFS-9129 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Haohui Mai > Assignee: Mingliang Liu > Attachments: HDFS-9129.000.patch, HDFS-9129.001.patch, > HDFS-9129.002.patch, HDFS-9129.003.patch, HDFS-9129.004.patch, > HDFS-9129.005.patch, HDFS-9129.006.patch, HDFS-9129.007.patch, > HDFS-9129.008.patch, HDFS-9129.009.patch, HDFS-9129.010.patch, > HDFS-9129.011.patch, HDFS-9129.012.patch, HDFS-9129.013.patch, > HDFS-9129.014.patch, HDFS-9129.015.patch, HDFS-9129.016.patch, > HDFS-9129.017.patch, HDFS-9129.018.patch, HDFS-9129.019.patch, > HDFS-9129.020.patch, HDFS-9129.021.patch > > > The {{SafeMode}} needs to track whether there are enough blocks so that the > NN can get out of the safemode. These fields can moved to the > {{BlockManager}} class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)