[ https://issues.apache.org/jira/browse/HDFS-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang resolved HDFS-10585. ------------------------------------ Resolution: Duplicate > Incorrect offset/length calculation in pipeline recovery causes block > corruption > -------------------------------------------------------------------------------- > > Key: HDFS-10585 > URL: https://issues.apache.org/jira/browse/HDFS-10585 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Priority: Major > > We found incorrect offset and length calculation in pipeline recovery may > cause block corruption and results in missing blocks under a very unfortunate > scenario. > (1) A client established pipeline and started writing data to the pipeline. > (2) One of the data node in the pipeline restarted, closing the socket, and > some written data were unacknowledged. > (3) Client replaced the failed data node with a new one, initiating block > transfer to copy existing data in the block to the new datanode. > (4) The block is transferred to the new node. Crucially, the entire block, > including the unacknowledged data, was transferred. > (5) The last chunk (512 bytes) was not a full chunk, but the destination > still reserved the whole chunk in its buffer, and wrote the entire buffer to > disk, therefore some written data is garbage. > (6) When the transfer was done, the destination data node converted the > replica from temporary to rbw, which made its visible length as the length of > bytes on disk. That is to say, it thought whatever was transferred was > acknowledged. However, the visible length of the replica is different (round > up to the next multiple of 512) than the source of transfer. > (7) Client then truncated the block in the attempt to remove unacknowledged > data. However, because the visible length is equivalent of the bytes on disk, > it did not truncate unacknowledged data. > (8) When new data was appended to the destination, it skipped the bytes > already on disk. Therefore, whatever was written as garbage was not replaced. > (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it > wouldn't tell NameNode to mark the replica as corrupt, so the client > continued to form a pipeline using the corrupt replica. > (10) Finally the DN that had the only healthy replica was restarted. NameNode > then update the pipeline to only contain the corrupt replica. > (11) Client continue to write to the corrupt replica, because neither client > nor the data node itself knows the replica is corrupt. When the restarted > datanodes comes back, their replica are stale, despite they are not corrupt. > Therefore, none of the replica is good and up to date. > The sequence of events was reconstructed based on DataNode/NameNode log and > my understanding of code. > Incidentally, we have observed the same sequence of events on two independent > clusters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org