[jira] [Commented] (HDFS-2742) HA: observed dataloss in replication stress test

Todd Lipcon (Commented) (JIRA) Fri, 06 Jan 2012 19:28:20 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181796#comment-13181796
 ]


Todd Lipcon commented on HDFS-2742:
-----------------------------------

This is really two separate bugs.
Bug 1 (HA specific): if a block report reports a RBW replica, and it's delayed 
on the SBN, then the file is closed before the SBN processes the delayed block 
report, it will mark the block as corrupt incorrectly. I'll post a patch to fix 
this bug - just running some more tests on it.

Bug 2 (non-HA): If a block is marked corrupt, and the DN does a block report, 
it will unmark the corrupt state of that block. I believe this is a regression 
of HDFS-900, but there wasn't any unit test with that bug fix so it's hard to 
know if it's the same or subtly different. I believe this affects the non-HA 
case as well, so I'll file a JIRA for it against trunk.
                
> HA: observed dataloss in replication stress test
> ------------------------------------------------
>
>                 Key: HDFS-2742
>                 URL: https://issues.apache.org/jira/browse/HDFS-2742
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node, ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2742) HA: observed dataloss in replication stress test

Reply via email to