[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250006#comment-14250006 ] Hudson commented on HDFS-6425: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1995 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1995/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249951#comment-14249951 ] Hudson commented on HDFS-6425: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #45 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/45/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249908#comment-14249908 ] Hudson commented on HDFS-6425: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #41 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/41/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249898#comment-14249898 ] Hudson commented on HDFS-6425: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1976 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1976/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249743#comment-14249743 ] Hudson commented on HDFS-6425: -- FAILURE: Integrated in Hadoop-Yarn-trunk #778 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/778/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249737#comment-14249737 ] Hudson commented on HDFS-6425: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #44 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/44/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248465#comment-14248465 ] Hudson commented on HDFS-6425: -- FAILURE: Integrated in Hadoop-trunk-Commit #6729 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6729/]) HDFS-6425. Large postponedMisreplicatedBlocks has impact on blockReport latency. Contributed by Ming Ma. (kihwal: rev b7923a356e9f111619375b94d12749d634069347) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencing.java * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDNFencingWithReplication.java * hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248463#comment-14248463 ] Ming Ma commented on HDFS-6425: --- Thanks [~kihwal] and [~arpitagarwal]. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.7.0 > > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248448#comment-14248448 ] Kihwal Lee commented on HDFS-6425: -- +1 I've written a similar patch for the same issue. The patch looks good. Thanks for posting the updated patch. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247695#comment-14247695 ] Hadoop QA commented on HDFS-6425: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687355/HDFS-6425-3.patch against trunk revision a095622. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9044//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9044//console This message is automatically generated. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, > HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246808#comment-14246808 ] Kihwal Lee commented on HDFS-6425: -- Did you have a chance to analyze the cause of the large number of over-replication? It might be due to the race between completeFile and incremental block reports. If a file is closed with just min_replicas and the replication monitor runs before all the rest of incremental block reports are received, replication will be scheduled and this will lead to over-replication. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-Test-Case.pdf, > HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246679#comment-14246679 ] Hadoop QA commented on HDFS-6425: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12661072/HDFS-6425-2.patch against trunk revision fae3e86. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9037//console This message is automatically generated. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-Test-Case.pdf, > HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246672#comment-14246672 ] Kihwal Lee commented on HDFS-6425: -- [~mingma] The patch looks good, but does not apply to trunk any more. Can you refresh it? > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-Test-Case.pdf, > HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093558#comment-14093558 ] Hadoop QA commented on HDFS-6425: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12661072/HDFS-6425-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7611//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7611//console This message is automatically generated. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-2.patch, HDFS-6425-Test-Case.pdf, > HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089967#comment-14089967 ] Ming Ma commented on HDFS-6425: --- Thanks, Arpit. This jira can address more common NN failover scenario with lots of "content stale" storages. We try to get storages out of "content stale" as soon as possible. Here are several scenarios. a. For non-HA NN restart, have DN send HB before BR right after registration. b. For HA setup, NN becomes active right after it restarts. This can happen if we have to restart both NNs at the same time, due to some rare outage or some incompatible upgrade. In this case, the active NN will first go to standby, then get transitioned to active at which point all DNs will be marked as stale again. For big clusters, most of the DN reregistration will come in after the NN becomes active, so the fix to have DNs send HB and BR right after registration will also help. c. For HA setup, NN becomes active after the NN JVM has been up for some time. The failover could happen due to zk session timeout, or the other NN just crashes. In this case, there is no DN reregistration given the new active NN doesn't have recent restart. We can change the NN to ask DN to resend blockreport upon failover, but that will cause cluster performance issue. So we still have some scenario where we might have lots of "content stale" storages. This jira tries to make NN handle the scenario better. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089653#comment-14089653 ] Arpit Agarwal commented on HDFS-6425: - Hi Ming, is this problem mitigated by your fix for HDFS-6772? > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085630#comment-14085630 ] Hadoop QA commented on HDFS-6425: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659743/HDFS-6425.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7560//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7560//console This message is automatically generated. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085278#comment-14085278 ] Hadoop QA commented on HDFS-6425: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12659706/HDFS-6425-Test-Case.pdf against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7559//console This message is automatically generated. > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425-Test-Case.pdf, HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
[ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14083131#comment-14083131 ] Ming Ma commented on HDFS-6425: --- PostponedMisreplicatedBlocks > 1M can cause BR latency to spike to couple seconds. 1. Given all the DNs are marked as blockContentsStale after NN fail overs, rescanPostponedMisreplicatedBlocks isn't going to find many blocks to remove until majority of DNs send their blockreports. 2. Normally it is ok to not remove stay over replicated right away. So rescanPostponedMisreplicatedBlocks can wait until most if not all of the DN storages aren't marked as blockContentsStale anymore. Ideas on how to fix it: Rescan postponed blocks only after all/most of DN storages aren't marked as blockContentsStale anymore. In that way, postponed blocks won't impact BR until most of DNs have sent BRs. After that, postponed blocks will be drained steadily. We can do it in a background thread instead of during BR call. Alternatively, [~lohit] and I also discussed using HashMap to store postponed blocks, keyed by DN storage, that means each BR doesn't need to scan the whole set and thus improve the performance. Suggestions? > Large postponedMisreplicatedBlocks has impact on blockReport latency > > > Key: HDFS-6425 > URL: https://issues.apache.org/jira/browse/HDFS-6425 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-6425.patch > > > Sometimes we have large number of over replicates when NN fails over. When > the new active NN took over, over replicated blocks will be put to > postponedMisreplicatedBlocks until all DNs for that block aren't stale > anymore. > We have a case where NNs flip flop. Before postponedMisreplicatedBlocks > became empty, NN fail over again and again. So postponedMisreplicatedBlocks > just kept increasing until the cluster is stable. > In addition, large postponedMisreplicatedBlocks could make > rescanPostponedMisreplicatedBlocks slow. rescanPostponedMisreplicatedBlocks > takes write lock. So it could slow down the block report processing. -- This message was sent by Atlassian JIRA (v6.2#6252)