[jira] [Commented] (HDFS-7065) Pipeline close recovery race can cause block corruption

Hudson (JIRA) Sat, 20 Sep 2014 04:35:08 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141939#comment-14141939
 ]


Hudson commented on HDFS-7065:
------------------------------

SUCCESS: Integrated in Hadoop-Yarn-trunk #686 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/686/])
HDFS-7065. Pipeline close recovery race can cause block corruption. (kihwal: 
rev bf27b9ca574592ef603e126bacb9b6a37c9eb5c6)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java


> Pipeline close recovery race can cause block corruption
> -------------------------------------------------------
>
>                 Key: HDFS-7065
>                 URL: https://issues.apache.org/jira/browse/HDFS-7065
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.5.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>             Fix For: 2.6.0
>
>         Attachments: HDFS-7065.patch
>
>
> If multiple pipeline close recoveries are performed against the same block, 
> the replica may go corrupt.  Here is one case I have observed:
> The client tried to close a block, but the ACK timed out.  It excluded the 
> first DN and tried pipeline recovery (recoverClose). It too failed and 
> another recovery was attempted with only one DN.  This took more than usual 
> but the client eventually got an ACK and the file was closed successfully.  
> Later on the one and only replica was found to be corrupt.
> It turned out the DN was having transient slow disk I/O issue at that time. 
> The first recovery was stuck until the second recovery was attempted 30 
> seconds later.  After few seconds, they both threads started running. The 
> second recovery finished first and then the first recovery with an older gen 
> stamp finished, turning gen stamp backward.
> There is a sanity check in {{recoverCheck()}}, but since check and modify are 
> not synchronized, {{recoverClose()}} is not multi-thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7065) Pipeline close recovery race can cause block corruption

Reply via email to