[ https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547634#comment-13547634 ]
Arun C Murthy commented on HDFS-3875: ------------------------------------- Thanks [~sureshms]! I'm looking to wrap up 2.0.3-alpha asap. > Issue handling checksum errors in write pipeline > ------------------------------------------------ > > Key: HDFS-3875 > URL: https://issues.apache.org/jira/browse/HDFS-3875 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.0.2-alpha > Reporter: Todd Lipcon > Assignee: Kihwal Lee > Priority: Blocker > Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, > hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, > hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, > hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, > hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch > > > We saw this issue with one block in a large test cluster. The client is > storing the data with replication level 2, and we saw the following: > - the second node in the pipeline detects a checksum error on the data it > received from the first node. We don't know if the client sent a bad > checksum, or if it got corrupted between node 1 and node 2 in the pipeline. > - this caused the second node to get kicked out of the pipeline, since it > threw an exception. The pipeline started up again with only one replica (the > first node in the pipeline) > - this replica was later determined to be corrupt by the block scanner, and > unrecoverable since it is the only replica -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira