[ https://issues.apache.org/jira/browse/HDFS-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090678#comment-14090678 ]
Hudson commented on HDFS-6772: ------------------------------ FAILURE: Integrated in Hadoop-Yarn-trunk #638 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/638/]) HDFS-6772. Get DN storages out of blockContentsStale state faster after NN restarts. (Contributed by Ming Ma) (arp: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1616680) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/HeartbeatManager.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/metrics/FSNamesystemMBean.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/protocol/RegisterCommand.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSNamesystemMBean.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestStartup.java > Get DN storages out of blockContentsStale state faster after NN restarts > ------------------------------------------------------------------------ > > Key: HDFS-6772 > URL: https://issues.apache.org/jira/browse/HDFS-6772 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Ming Ma > Assignee: Ming Ma > Fix For: 3.0.0, 2.6.0 > > Attachments: HDFS-6772-2.patch, HDFS-6772-3.patch, HDFS-6772.patch > > > Here is the non-HA scenario. > 1. Get HDFS into block-over-replicated situation. > 2. Restart the NN. > 3. From NN's point of view, DNs will remain in blockContentsStale==true state > for a long time. That in turns make postponedMisreplicatedBlocks size big. > Bigger postponedMisreplicatedBlocks size will impact blockreport latency. > Given blockreport takes NN global lock, it has severe impact on NN > performance and make the cluster unstable. > Why will DNs remain in blockContentsStale==true state for a long time? > 1. When a DN reconnect to NN upon NN restart, blockreport RPC could come in > before heartbeat RPC. That is due to how BPServiceActor#offerService decides > when to send blockreport and heartbeat. In the case of NN restart, NN will > ask DN to register when NN gets the first heartbeat request; DN will then > register with NN; followed by blockreport RPC; the heartbeat RPC will come > after that. > 2. So right after the first blockreport, given heartbeatedSinceFailover > remains false, blockContentsStale will stay true. > {noformat} > DatanodeStorageInfo.java > void receivedBlockReport() { > if (heartbeatedSinceFailover) { > blockContentsStale = false; > } > blockReportCount++; > } > {noformat} > 3. So the DN will remain in blockContentsStale==true until the next > blockreport. For big cluster, dfs.blockreport.intervalMsec could be set to > some large value. > -- This message was sent by Atlassian JIRA (v6.2#6252)