[ https://issues.apache.org/jira/browse/HDFS-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15991386#comment-15991386 ]
Hudson commented on HDFS-11609: ------------------------------- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11661 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/11661/]) HDFS-11609. Some blocks can be permanently lost if nodes are (kihwal: rev 07b98e7830c2214340cb7f434df674057e89df94) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/LowRedundancyBlocks.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java > Some blocks can be permanently lost if nodes are decommissioned while dead > -------------------------------------------------------------------------- > > Key: HDFS-11609 > URL: https://issues.apache.org/jira/browse/HDFS-11609 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.7.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Blocker > Fix For: 2.7.4, 3.0.0-alpha3, 2.8.1 > > Attachments: HDFS-11609.branch-2.patch, HDFS-11609.trunk.patch, > HDFS-11609_v2.branch-2.patch, HDFS-11609_v2.trunk.patch, > HDFS-11609_v3.branch-2.7.patch, HDFS-11609_v3.branch-2.patch, > HDFS-11609_v3.trunk.patch > > > When all the nodes containing a replica of a block are decommissioned while > they are dead, they get decommissioned right away even if there are missing > blocks. This behavior was introduced by HDFS-7374. > The problem starts when those decommissioned nodes are brought back online. > The namenode no longer shows missing blocks, which creates a false sense of > cluster health. When the decommissioned nodes are removed and reformatted, > the block data is permanently lost. The namenode will report missing blocks > after the heartbeat recheck interval (e.g. 10 minutes) from the moment the > last node is taken down. > There are multiple issues in the code. As some cause different behaviors in > testing vs. production, it took a while to reproduce it in a unit test. I > will present analysis and proposal soon. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org