[ 
https://issues.apache.org/jira/browse/HDFS-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-7065:
-----------------------------
    Status: Patch Available  (was: Open)

> Pipeline close recovery race can cause block corruption
> -------------------------------------------------------
>
>                 Key: HDFS-7065
>                 URL: https://issues.apache.org/jira/browse/HDFS-7065
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>            Reporter: Kihwal Lee
>            Priority: Critical
>         Attachments: HDFS-7065.patch
>
>
> If multiple pipeline close recoveries are performed against the same block, 
> the replica may go corrupt.  Here is one case I have observed:
> The client tried to close a block, but the ACK timed out.  It excluded the 
> first DN and tried pipeline recovery (recoverClose). It too failed and 
> another recovery was attempted with only one DN.  This took more than usual 
> but the client eventually got an ACK and the file was closed successfully.  
> Later on the one and only replica was found to be corrupt.
> It turned out the DN was having transient slow disk I/O issue at that time. 
> The first recovery was stuck until the second recovery was attempted 30 
> seconds later.  After few seconds, they both threads started running. The 
> second recovery finished first and then the first recovery with an older gen 
> stamp finished, turning gen stamp backward.
> There is a sanity check in {{recoverCheck()}}, but since check and modify are 
> not synchronized, {{recoverClose()}} is not multi-thread safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to