[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo (Nicholas), SZE updated HDFS-1595: ----------------------------------------- Description: Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing . However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when transferring a small amount of data, say 1MB, but it often fails when transferring a large amount of data, say 100MB. It is even worst if F is the first datanode in the pipeline. Consider the following: # DFSClient creates a pipeline with three datanodes. The first datanode is F. # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error. # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s). # The pipeline now has two datanodes but (2) and (3) repeat. # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F. # The write succeeds and DFSClient is able to *close the file successfully*. # The block is under replicated. The NameNode schedules replication from F to some other datanode D. # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted. # The NameNode marks the replica in F is corrupted. # The block is corrupted since no replica is available. We were able to manually divide the replicas into small files and copy them out from F without fixing the hardware. The replicas seems uncorrupted. This is a *data availability problem*. was: Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing . However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. Reading is working fine for any data size. It is even worst if F is the first datanode in the pipeline. Consider the following: # DFSClient creates a pipeline with three datanodes. The first datanode is F. # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error. # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s). # The pipeline now has two datanodes but (2) and (3) repeat. # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F. # The write succeeds and DFSClient is able to *close the file successfully*. # The block is under replicated. The NameNode schedules replication from F to some other datanode D. # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted. # The NameNode marks the replica in F is corrupted. # The block is corrupted since no replica is available. This is a *data loss* scenario. Revised the description. Thanks Koji and Dhruba for correcting me. > DFSClient may incorrectly detect datanode failure > ------------------------------------------------- > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client > Affects Versions: 0.20.4 > Reporter: Tsz Wo (Nicholas), SZE > Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when transferring > a small amount of data, say 1MB, but it often fails when transferring a large > amount of data, say 100MB. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > We were able to manually divide the replicas into small files and copy them > out from F without fixing the hardware. The replicas seems uncorrupted. > This is a *data availability problem*. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.