[jira] [Updated] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2013-02-02 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HDFS-3771:
--

Affects Version/s: (was: 2.0.0-alpha)

This isn't needed in 2.x - perhaps the 0.23.x maintainers want to keep this 
open for 0.23.x? Otherwise feel free to close. (I removed the 2.x affects 
version)

 Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
 and edit log rolling
 

 Key: HDFS-3771
 URL: https://issues.apache.org/jira/browse/HDFS-3771
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 0.23.3
 Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
 using Kerberos based security
Reporter: patrick white
Priority: Critical

 Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
 issue recently, which resulted in the cluster's default Namenode being unable 
 to restart, this was on a 20 node Federated cluster with security. The cause 
 appears to be that the NN was just starting to roll its edit log when a 
 shutdown occurred, the shutdown was intentional to restart the cluster as 
 part of an automated test.
 The tests that were running do not appear to be the issue in themselves, the 
 cluster was just wrapping up an adminReport subset and this failure case has 
 not reproduce so far, nor was it failing previously. It looks like a chance 
 occurrence of sending the shutdown just as the edit log roll was begun.
 From the NN log, the following sequence is noted:
 1. an InvalidateBlocks operation had completed
 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
 3. FSEditLog: Ending log segment 23963
 4. FSEditLog: Starting log segment at 23967
 4. NameNode: SHUTDOWN_MSG
 = the NN shuts down and then is restarted...
 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
 are all in-progress
 6. FSImageTransactionalStorageInspector: Marking log at 
 /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
 transactions in it.
 7. NameNode: Exception in namenode join 
 [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
 = NN start attempts continue to cycle trying to restart but can't, failing 
 on the same exception due to lack of non-corrupt edit logs
 If observations are correct and issue is from shutdown happening as edit logs 
 are rolling, does the NN have an equivalent to the conventional fs 'sync' 
 blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-07 Thread Suresh Srinivas (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Srinivas updated HDFS-3771:
--

Affects Version/s: 2.0.0-alpha

 Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
 and edit log rolling
 

 Key: HDFS-3771
 URL: https://issues.apache.org/jira/browse/HDFS-3771
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: name-node
Affects Versions: 0.23.3, 2.0.0-alpha
 Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
 using Kerberos based security
Reporter: patrick white
Priority: Critical

 Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
 issue recently, which resulted in the cluster's default Namenode being unable 
 to restart, this was on a 20 node Federated cluster with security. The cause 
 appears to be that the NN was just starting to roll its edit log when a 
 shutdown occurred, the shutdown was intentional to restart the cluster as 
 part of an automated test.
 The tests that were running do not appear to be the issue in themselves, the 
 cluster was just wrapping up an adminReport subset and this failure case has 
 not reproduce so far, nor was it failing previously. It looks like a chance 
 occurrence of sending the shutdown just as the edit log roll was begun.
 From the NN log, the following sequence is noted:
 1. an InvalidateBlocks operation had completed
 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
 3. FSEditLog: Ending log segment 23963
 4. FSEditLog: Starting log segment at 23967
 4. NameNode: SHUTDOWN_MSG
 = the NN shuts down and then is restarted...
 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
 are all in-progress
 6. FSImageTransactionalStorageInspector: Marking log at 
 /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
 transactions in it.
 7. NameNode: Exception in namenode join 
 [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
 = NN start attempts continue to cycle trying to restart but can't, failing 
 on the same exception due to lack of non-corrupt edit logs
 If observations are correct and issue is from shutdown happening as edit logs 
 are rolling, does the NN have an equivalent to the conventional fs 'sync' 
 blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira