[ https://issues.apache.org/jira/browse/HDFS-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907902#action_12907902 ]
dhruba borthakur commented on HDFS-1382: ---------------------------------------- This seems to be a valid bug in the transaction/edits handling. > A transient failure with edits log and a corrupted fstime together could lead > to a data loss > -------------------------------------------------------------------------------------------- > > Key: HDFS-1382 > URL: https://issues.apache.org/jira/browse/HDFS-1382 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Reporter: Thanh Do > > We experienced a data loss situation that due to double failures. > One is transient disk failure with edits logs and the other is corrupted > fstime. > > Here is the detail: > > 1. NameNode has 2 edits directory (say edit0 and edit1) > > 2. During an update to edit0, there is a transient disk failure, > making NameNode bump the fstime and mark edit0 as stale > and continue working with edit1. > > 3. NameNode is shut down. Now, and unluckily fstime in edit0 > is corrupted. Hence during NameNode startup, the log in edit0 > is replayed, hence data loss. > This bug was found by our Failure Testing Service framework: > http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html > For questions, please email us: Thanh Do (than...@cs.wisc.edu) and > Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.