[ https://issues.apache.org/jira/browse/HDFS-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141939#comment-14141939 ]
Hudson commented on HDFS-7065: ------------------------------ SUCCESS: Integrated in Hadoop-Yarn-trunk #686 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/686/]) HDFS-7065. Pipeline close recovery race can cause block corruption. (kihwal: rev bf27b9ca574592ef603e126bacb9b6a37c9eb5c6) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java > Pipeline close recovery race can cause block corruption > ------------------------------------------------------- > > Key: HDFS-7065 > URL: https://issues.apache.org/jira/browse/HDFS-7065 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.5.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Fix For: 2.6.0 > > Attachments: HDFS-7065.patch > > > If multiple pipeline close recoveries are performed against the same block, > the replica may go corrupt. Here is one case I have observed: > The client tried to close a block, but the ACK timed out. It excluded the > first DN and tried pipeline recovery (recoverClose). It too failed and > another recovery was attempted with only one DN. This took more than usual > but the client eventually got an ACK and the file was closed successfully. > Later on the one and only replica was found to be corrupt. > It turned out the DN was having transient slow disk I/O issue at that time. > The first recovery was stuck until the second recovery was attempted 30 > seconds later. After few seconds, they both threads started running. The > second recovery finished first and then the first recovery with an older gen > stamp finished, turning gen stamp backward. > There is a sanity check in {{recoverCheck()}}, but since check and modify are > not synchronized, {{recoverClose()}} is not multi-thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)