[ https://issues.apache.org/jira/browse/HDFS-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo (Nicholas), SZE updated HDFS-1595: ----------------------------------------- Description: Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing . However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. Reading is working fine for any data size. It is even worst if F is the first datanode in the pipeline. Consider the following: # DFSClient creates a pipeline with three datanodes. The first datanode is F. # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error. # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s). # The pipeline now has two datanodes but (2) and (3) repeat. # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F. # The write succeeds and DFSClient is able to *close the file successfully*. # The block is under replicated. The NameNode schedules replication from F to some other datanode D. # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted. # The NameNode marks the replica in F is corrupted. # The block is corrupted since no replica is available. This is a *data loss* scenario. was: Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing . However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. It is even worst if F is the first datanode in the pipeline. Consider the following: # DFSClient creates a pipeline with three datanodes. The first datanode is F. # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error. # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s). # The pipeline now has two datanodes but (2) and (3) repeat. # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F. # The write succeeds and DFSClient is able to *close the file successfully*. # The block is under replicated. The NameNode schedules replication from F to some other datanode D. # The replication fails from the same reason. D reports to the NameNode that the replica in F is corrupted. # The NameNode marks the replica in F is corrupted. # The block is corrupted since no replica is available. This is a *data loss* scenario. Yes, reading is working fine for any data size. (updated also the description.) > DFSClient may incorrectly detect datanode failure > ------------------------------------------------- > > Key: HDFS-1595 > URL: https://issues.apache.org/jira/browse/HDFS-1595 > Project: Hadoop HDFS > Issue Type: Bug > Components: data-node, hdfs client > Affects Versions: 0.20.4 > Reporter: Tsz Wo (Nicholas), SZE > Priority: Critical > Attachments: hdfs-1595-idea.txt > > > Suppose a source datanode S is writing to a destination datanode D in a write > pipeline. We have an implicit assumption that _if S catches an exception > when it is writing to D, then D is faulty and S is fine._ As a result, > DFSClient will take out D from the pipeline, reconstruct the write pipeline > with the remaining datanodes and then continue writing . > However, we find a case that the faulty machine F is indeed S but not D. In > the case we found, F has a faulty network interface (or a faulty switch port) > in such a way that the faulty network interface works fine when sending out a > small amount of data, say 1MB, but it fails when sending out a large amount > of data, say 100MB. Reading is working fine for any data size. > It is even worst if F is the first datanode in the pipeline. Consider the > following: > # DFSClient creates a pipeline with three datanodes. The first datanode is F. > # F catches an IOException when writing to the second datanode. Then, F > reports the second datanode has error. > # DFSClient removes the second datanode from the pipeline and continue > writing with the remaining datanode(s). > # The pipeline now has two datanodes but (2) and (3) repeat. > # Now, only F remains in the pipeline. DFSClient continues writing with one > replica in F. > # The write succeeds and DFSClient is able to *close the file successfully*. > # The block is under replicated. The NameNode schedules replication from F > to some other datanode D. > # The replication fails for the same reason. D reports to the NameNode that > the replica in F is corrupted. > # The NameNode marks the replica in F is corrupted. > # The block is corrupted since no replica is available. > This is a *data loss* scenario. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.