[ https://issues.apache.org/jira/browse/HDFS-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652169#comment-13652169 ]
Aaron T. Myers commented on HDFS-4799: -------------------------------------- Great sleuthing, Todd. Test and fix look great. +1, the patch looks good to me. > Corrupt replica can be prematurely removed from corruptReplicas map > ------------------------------------------------------------------- > > Key: HDFS-4799 > URL: https://issues.apache.org/jira/browse/HDFS-4799 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.4-alpha > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-4799.txt, hdfs-4799-unittest.txt > > > We saw the following sequence of events in a cluster result in losing the > most recent genstamp of a block: > - client is writing to a pipeline of 3 > - the pipeline had nodes fail over some period of time, such that it left 3 > old-genstamp replicas on the original three nodes, having recruited 3 new > replicas with a later genstamp. > -- so, we have 6 total replicas in the cluster, three with old genstamps on > downed nodes, and 3 with the latest genstamp > - cluster reboots, and the nodes with old genstamps blockReport first. The > replicas are correctly added to the corrupt replicas map since they have a > too-old genstamp > - the nodes with the new genstamp block report. When the latest one block > reports, chooseExcessReplicates is called and incorrectly decides to remove > the three good replicas, leaving only the old-genstamp replicas. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira