[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430575#comment-13430575
 ] 

Todd Lipcon commented on HDFS-3771:
-----------------------------------

Ah, OK, I think the log message thing I mentioned above was a red herring. The 
previous segment _started_ at 23963, and that's what it was logging. Not a 
problem.

Can you upload /grid/[PATH]/edits_inprogress_0000000000000023967 which may now 
be renamed with a ".corrupt" suffix of some kind? I want to make sure it is in 
fact empty and not some kind of strange corruption.
                
> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3771
>                 URL: https://issues.apache.org/jira/browse/HDFS-3771
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.3, 2.0.0-alpha
>         Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>            Reporter: patrick white
>            Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to