[ https://issues.apache.org/jira/browse/HDFS-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kihwal Lee reassigned HDFS-7065: -------------------------------- Assignee: Kihwal Lee > Pipeline close recovery race can cause block corruption > ------------------------------------------------------- > > Key: HDFS-7065 > URL: https://issues.apache.org/jira/browse/HDFS-7065 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.5.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Attachments: HDFS-7065.patch > > > If multiple pipeline close recoveries are performed against the same block, > the replica may go corrupt. Here is one case I have observed: > The client tried to close a block, but the ACK timed out. It excluded the > first DN and tried pipeline recovery (recoverClose). It too failed and > another recovery was attempted with only one DN. This took more than usual > but the client eventually got an ACK and the file was closed successfully. > Later on the one and only replica was found to be corrupt. > It turned out the DN was having transient slow disk I/O issue at that time. > The first recovery was stuck until the second recovery was attempted 30 > seconds later. After few seconds, they both threads started running. The > second recovery finished first and then the first recovery with an older gen > stamp finished, turning gen stamp backward. > There is a sanity check in {{recoverCheck()}}, but since check and modify are > not synchronized, {{recoverClose()}} is not multi-thread safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)