[ https://issues.apache.org/jira/browse/HADOOP-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694334#action_12694334 ]
Raghu Angadi commented on HADOOP-5605: -------------------------------------- I think it is simpler to just describe what happened : * Some datanode went down and X and Y are the remaining replicas for B (of length 53384156) . * NN asked X to replicate the block B to P * P reported B as corrupt since CRC failed on its side. * NN marks B on X as corrupt. * NN asks Y to replicate B to Q * Y reports the following error in the log and reports the block as corrupt. ** _Can't replicate block B because on-disk length 53384156 is shorter than NameNode recorded length 134217728_ * This mismatch happened since when P reported the first corruption, it used a block object with default length of 128 MB, and NN incorrectly keeps that object in its neededReplication queue. This affects only 0.20 since 0.20 started reporting length and CRC mismatches during replication attempts. > All the replicas incorrectly got marked as corrupt. > --------------------------------------------------- > > Key: HADOOP-5605 > URL: https://issues.apache.org/jira/browse/HADOOP-5605 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.20.0 > Reporter: Raghu Angadi > Fix For: 0.2.0 > > > NameNode does not handle {{reportBadBlocks()}} properly. As a result, when > DataNode reports the corruption (only in the case of block transfer between > two datanodes), further attempts to replicate the block end up marking all > the replicas as corrupt! > From the implementation, it looks like NN incorrectly uses the block object > used in RPC to queue to neededReplication queue instead of using internal > block object. > will include an actual example in the next comment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.