[ 
https://issues.apache.org/jira/browse/HDFS-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5438:
-----------------------------

    Status: Patch Available  (was: Open)

> Flaws in block report processing can cause data loss
> ----------------------------------------------------
>
>                 Key: HDFS-5438
>                 URL: https://issues.apache.org/jira/browse/HDFS-5438
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.2.0, 0.23.9
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: HDFS-5438.trunk.patch
>
>
> The incremental block reports from data nodes and block commits are 
> asynchronous. This becomes troublesome when the gen stamp for a block is 
> changed during a write pipeline recovery.
> * If an incremental block report is delayed from a node but NN had enough 
> replicas already, a report with the old gen stamp may be received after block 
> completion. This replica will be correctly marked corrupt. But if the node 
> had participated in the pipeline recovery, a new (delayed) report with the 
> correct gen stamp will come soon. However, this report won't have any effect 
> on the corrupt state of the replica.
> * If block reports are received while the block is still under construction 
> (i.e. client's call to make block committed has not been received by NN), 
> they are blindly accepted regardless of the gen stamp. If a failed node 
> reports in with the old gen stamp while pipeline recovery is on-going, it 
> will be accepted and counted as valid during commit of the block.
> Due to the above two problems, correct replicas can be marked corrupt and 
> corrupt replicas can be accepted during commit.  So far we have observed two 
> cases in production.
> * The client hangs forever to close a file. All replicas are marked corrupt.
> * After the successful close of a file, read fails. Corrupt replicas are 
> accepted during commit and valid replicas are marked corrupt afterward.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to