[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188507#comment-13188507 ] Hudson commented on HDFS-2795: -- Integrated in Hadoop-Hdfs-HAbranch-build #51 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/51/]) Amend HDFS-2795. Fix PersistBlocks failure due to an NPE in isPopulatingReplQueues() todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1232510 Files : * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java HA: Standby NN takes a long time to recover from a dead DN starting up -- Key: HDFS-2795 URL: https://issues.apache.org/jira/browse/HDFS-2795 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Priority: Critical Fix For: HA branch (HDFS-1623) Attachments: hdfs-2795.txt To reproduce: # Start an HA cluster with a DN. # Write several blocks to the FS with replication 1. # Shutdown the DN # Wait for the NNs to declare the DN dead. All blocks will be under-replicated. # Restart the DN. Note that upon restarting the DN, the active NN will immediately get all block locations from the initial BR. The standby NN will not, and instead will slowly add block locations for a subset of the previously-missing blocks on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187633#comment-13187633 ] Hudson commented on HDFS-2795: -- Integrated in Hadoop-Hdfs-HAbranch-build #50 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/50/]) HDFS-2795. Standby NN takes a long time to recover from a dead DN starting up. Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1232285 Files : * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestNodeCount.java * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java HA: Standby NN takes a long time to recover from a dead DN starting up -- Key: HDFS-2795 URL: https://issues.apache.org/jira/browse/HDFS-2795 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Priority: Critical Fix For: HA branch (HDFS-1623) Attachments: hdfs-2795.txt To reproduce: # Start an HA cluster with a DN. # Write several blocks to the FS with replication 1. # Shutdown the DN # Wait for the NNs to declare the DN dead. All blocks will be under-replicated. # Restart the DN. Note that upon restarting the DN, the active NN will immediately get all block locations from the initial BR. The standby NN will not, and instead will slowly add block locations for a subset of the previously-missing blocks on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187306#comment-13187306 ] Aaron T. Myers commented on HDFS-2795: -- {code}// Create 20 blocks.{code} Unless I'm missing something, this test will actually create 5 blocks, not 20. Patch looks good otherwise. +1. HA: Standby NN takes a long time to recover from a dead DN starting up -- Key: HDFS-2795 URL: https://issues.apache.org/jira/browse/HDFS-2795 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Priority: Critical Attachments: hdfs-2795.txt To reproduce: # Start an HA cluster with a DN. # Write several blocks to the FS with replication 1. # Shutdown the DN # Wait for the NNs to declare the DN dead. All blocks will be under-replicated. # Restart the DN. Note that upon restarting the DN, the active NN will immediately get all block locations from the initial BR. The standby NN will not, and instead will slowly add block locations for a subset of the previously-missing blocks on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187467#comment-13187467 ] Todd Lipcon commented on HDFS-2795: --- Woops, this broke one of the TestPersistBlocks tests -- when it replays append, it was getting a NPE since {{haContext}} isn't set yet during startup. Does the following look OK? {code} diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/ap index 5e8377e..c57d152 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java @@ -3718,7 +3718,8 @@ public class FSNamesystem implements Namesystem, FSClusterStats, @Override public boolean isPopulatingReplQueues() { -if (!haContext.getState().shouldPopulateReplQueues()) { +if (haContext != null // null during startup! +!haContext.getState().shouldPopulateReplQueues()) { return false; } // safeMode is volatile, and may be set to null at any time {code} HA: Standby NN takes a long time to recover from a dead DN starting up -- Key: HDFS-2795 URL: https://issues.apache.org/jira/browse/HDFS-2795 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Priority: Critical Fix For: HA branch (HDFS-1623) Attachments: hdfs-2795.txt To reproduce: # Start an HA cluster with a DN. # Write several blocks to the FS with replication 1. # Shutdown the DN # Wait for the NNs to declare the DN dead. All blocks will be under-replicated. # Restart the DN. Note that upon restarting the DN, the active NN will immediately get all block locations from the initial BR. The standby NN will not, and instead will slowly add block locations for a subset of the previously-missing blocks on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187513#comment-13187513 ] Aaron T. Myers commented on HDFS-2795: -- +1, amendment looks good to me. HA: Standby NN takes a long time to recover from a dead DN starting up -- Key: HDFS-2795 URL: https://issues.apache.org/jira/browse/HDFS-2795 Project: Hadoop HDFS Issue Type: Sub-task Components: data-node, ha, name-node Affects Versions: HA branch (HDFS-1623) Reporter: Aaron T. Myers Assignee: Todd Lipcon Priority: Critical Fix For: HA branch (HDFS-1623) Attachments: hdfs-2795.txt To reproduce: # Start an HA cluster with a DN. # Write several blocks to the FS with replication 1. # Shutdown the DN # Wait for the NNs to declare the DN dead. All blocks will be under-replicated. # Restart the DN. Note that upon restarting the DN, the active NN will immediately get all block locations from the initial BR. The standby NN will not, and instead will slowly add block locations for a subset of the previously-missing blocks on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira