[ https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196301#comment-13196301 ]
Todd Lipcon commented on HDFS-2742: ----------------------------------- Sanjay makes a good point above about this being less critical since HDFS-2791 was addressed. But there are still some test cases that come with this patch that fail without the bug fix. Let me write up a more thorough explanation for Sanjay of why I think this should still get done before I commit it. > HA: observed dataloss in replication stress test > ------------------------------------------------ > > Key: HDFS-2742 > URL: https://issues.apache.org/jira/browse/HDFS-2742 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: data-node, ha, name-node > Affects Versions: HA branch (HDFS-1623) > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Blocker > Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, > hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt > > > The replication stress test case failed over the weekend since one of the > replicas went missing. Still diagnosing the issue, but it seems like the > chain of events was something like: > - a block report was generated on one of the nodes while the block was being > written - thus the block report listed the block as RBW > - when the standby replayed this queued message, it was replayed after the > file was marked complete. Thus it marked this replica as corrupt > - it asked the DN holding the corrupt replica to delete it. And, I think, > removed it from the block map at this time. > - That DN then did another block report before receiving the deletion. This > caused it to be re-added to the block map, since it was "FINALIZED" now. > - Replication was lowered on the file, and it counted the above replica as > non-corrupt, and asked for the other replicas to be deleted. > - All replicas were lost. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira