[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433466#comment-13433466
 ] 

Todd Lipcon commented on HDFS-3771:
-----------------------------------

Hey Patrick. I think this behavior might have been fixed in 2.0.0 already -- 
the empty file should get properly ignored and the NN should start up.

Perhaps you can instigate this failure again by adding "System.exit(0)" right 
before where {{START_LOG_SEGMENT}} is logged in 
{{startLogSegmentAndWriteHeaderTxn}}. That would allow you to see what the 
right recovery steps are.

The issue seems to be described in HDFS-2093... I think the following comment 
may be relevant:
{quote}
Thus in the situation above, where the only log we have is this corrupted one, 
it will refuse to let the NN start, with a nice message explaining that the 
logs starting at this txid are corrupt with no txns. The operator can then 
double-check whether a different storage drive which possibly went missing 
might have better logs, etc, before starting NN.
{quote}

Looking at your logs, it seems like you have only one edits directory. So the 
above probably applies, and you could successfully start by removing that last 
(empty) log segment.

bq. The larger concern should be for data loss. Based on what happened in this 
case it appears that any pending txids would be lost, unless the edit logs 
could be manually repaired. The filesystem would be intact, only minus the 
changes from the outstanding edit events, does that sound correct?

Only "in-flight" transactions could be lost -- ie those that were never ACKed 
to a client. Anything that has been ACKed would have been fsynced to the log, 
and thus not lost. So, after inspecting the segment to make sure there are 
truly no transactions, you should be able to remove it and start with no data 
loss or corruption whatsoever.

                
> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3771
>                 URL: https://issues.apache.org/jira/browse/HDFS-3771
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.3, 2.0.0-alpha
>         Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>            Reporter: patrick white
>            Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to