[jira] [Commented] (HDFS-2742) HA: observed dataloss in replication stress test

Eli Collins (Commented) (JIRA) Sun, 29 Jan 2012 17:09:34 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195900#comment-13195900
 ]


Eli Collins commented on HDFS-2742:
-----------------------------------

bq. I don't entirely follow what you're getting at here... so let's open a new 
JIRA 

See my last comment in HDFS-2791. If that makes sense we can follow up in a 
separate jira since it's not a 
new issue introduced in this change.

bq. I did the check for haEnabled in FSNamesystem rather than SafeModeInfo, 
since when HA is enabled it means we can avoid the volatile read of 
safeModeInfo. This is to avoid having any impact on the HA case. Is that what 
you're referring to?

Yes, I was saying you can remove the check against haEnabled, didn't realize 
you were doing it as a performance optimization.

bq. I changed setBlockTotal to only set shouldIncrementallyTrackBlocks to true 
when HA is enabled, and added assert haEnabled in adjustBlockTotals. Does that 
address your comment?

Yup, looks good, that's another way of asserting haEnabled if we're 
incrementally tracking blocks.

Nit: NameNodeAdapter has a duplicate import of SafeModeInfo.  Otherwise looks 
great, +1
                
> HA: observed dataloss in replication stress test
> ------------------------------------------------
>
>                 Key: HDFS-2742
>                 URL: https://issues.apache.org/jira/browse/HDFS-2742
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node, ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>         Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, 
> hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the 
> replicas went missing. Still diagnosing the issue, but it seems like the 
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being 
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the 
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think, 
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This 
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as 
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2742) HA: observed dataloss in replication stress test

Reply via email to