[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188507#comment-13188507 ] Hudson commented on HDFS-2795: -- Integrated in Hadoop-Hdfs-HAbranch-build #51 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/51/]) Amend HDFS-2795. Fix PersistBlocks failure due to an NPE in isPopulatingReplQueues() todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1232510 Files : * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java > HA: Standby NN takes a long time to recover from a dead DN starting up > -- > > Key: HDFS-2795 > URL: https://issues.apache.org/jira/browse/HDFS-2795 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Aaron T. Myers >Assignee: Todd Lipcon >Priority: Critical > Fix For: HA branch (HDFS-1623) > > Attachments: hdfs-2795.txt > > > To reproduce: > # Start an HA cluster with a DN. > # Write several blocks to the FS with replication 1. > # Shutdown the DN > # Wait for the NNs to declare the DN dead. All blocks will be > under-replicated. > # Restart the DN. > Note that upon restarting the DN, the active NN will immediately get all > block locations from the initial BR. The standby NN will not, and instead > will slowly add block locations for a subset of the previously-missing blocks > on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187633#comment-13187633 ] Hudson commented on HDFS-2795: -- Integrated in Hadoop-Hdfs-HAbranch-build #50 (See [https://builds.apache.org/job/Hadoop-Hdfs-HAbranch-build/50/]) HDFS-2795. Standby NN takes a long time to recover from a dead DN starting up. Contributed by Todd Lipcon. todd : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1232285 Files : * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/CHANGES.HDFS-1623.txt * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManagerTestUtil.java * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestNodeCount.java * /hadoop/common/branches/HDFS-1623/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyIsHot.java > HA: Standby NN takes a long time to recover from a dead DN starting up > -- > > Key: HDFS-2795 > URL: https://issues.apache.org/jira/browse/HDFS-2795 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Aaron T. Myers >Assignee: Todd Lipcon >Priority: Critical > Fix For: HA branch (HDFS-1623) > > Attachments: hdfs-2795.txt > > > To reproduce: > # Start an HA cluster with a DN. > # Write several blocks to the FS with replication 1. > # Shutdown the DN > # Wait for the NNs to declare the DN dead. All blocks will be > under-replicated. > # Restart the DN. > Note that upon restarting the DN, the active NN will immediately get all > block locations from the initial BR. The standby NN will not, and instead > will slowly add block locations for a subset of the previously-missing blocks > on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187513#comment-13187513 ] Aaron T. Myers commented on HDFS-2795: -- +1, amendment looks good to me. > HA: Standby NN takes a long time to recover from a dead DN starting up > -- > > Key: HDFS-2795 > URL: https://issues.apache.org/jira/browse/HDFS-2795 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Aaron T. Myers >Assignee: Todd Lipcon >Priority: Critical > Fix For: HA branch (HDFS-1623) > > Attachments: hdfs-2795.txt > > > To reproduce: > # Start an HA cluster with a DN. > # Write several blocks to the FS with replication 1. > # Shutdown the DN > # Wait for the NNs to declare the DN dead. All blocks will be > under-replicated. > # Restart the DN. > Note that upon restarting the DN, the active NN will immediately get all > block locations from the initial BR. The standby NN will not, and instead > will slowly add block locations for a subset of the previously-missing blocks > on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187467#comment-13187467 ] Todd Lipcon commented on HDFS-2795: --- Woops, this broke one of the TestPersistBlocks tests -- when it replays append, it was getting a NPE since {{haContext}} isn't set yet during startup. Does the following look OK? {code} diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/ap index 5e8377e..c57d152 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java @@ -3718,7 +3718,8 @@ public class FSNamesystem implements Namesystem, FSClusterStats, @Override public boolean isPopulatingReplQueues() { -if (!haContext.getState().shouldPopulateReplQueues()) { +if (haContext != null && // null during startup! +!haContext.getState().shouldPopulateReplQueues()) { return false; } // safeMode is volatile, and may be set to null at any time {code} > HA: Standby NN takes a long time to recover from a dead DN starting up > -- > > Key: HDFS-2795 > URL: https://issues.apache.org/jira/browse/HDFS-2795 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Aaron T. Myers >Assignee: Todd Lipcon >Priority: Critical > Fix For: HA branch (HDFS-1623) > > Attachments: hdfs-2795.txt > > > To reproduce: > # Start an HA cluster with a DN. > # Write several blocks to the FS with replication 1. > # Shutdown the DN > # Wait for the NNs to declare the DN dead. All blocks will be > under-replicated. > # Restart the DN. > Note that upon restarting the DN, the active NN will immediately get all > block locations from the initial BR. The standby NN will not, and instead > will slowly add block locations for a subset of the previously-missing blocks > on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2795) HA: Standby NN takes a long time to recover from a dead DN starting up
[ https://issues.apache.org/jira/browse/HDFS-2795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187306#comment-13187306 ] Aaron T. Myers commented on HDFS-2795: -- {code}// Create 20 blocks.{code} Unless I'm missing something, this test will actually create 5 blocks, not 20. Patch looks good otherwise. +1. > HA: Standby NN takes a long time to recover from a dead DN starting up > -- > > Key: HDFS-2795 > URL: https://issues.apache.org/jira/browse/HDFS-2795 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node >Affects Versions: HA branch (HDFS-1623) >Reporter: Aaron T. Myers >Assignee: Todd Lipcon >Priority: Critical > Attachments: hdfs-2795.txt > > > To reproduce: > # Start an HA cluster with a DN. > # Write several blocks to the FS with replication 1. > # Shutdown the DN > # Wait for the NNs to declare the DN dead. All blocks will be > under-replicated. > # Restart the DN. > Note that upon restarting the DN, the active NN will immediately get all > block locations from the initial BR. The standby NN will not, and instead > will slowly add block locations for a subset of the previously-missing blocks > on every DN heartbeat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira