[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2013-01-24 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562470#comment-13562470
 ] 

Suresh Srinivas commented on HDFS-3771:
---

Is this bug still needed. Can this be closed?

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-02 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489901#comment-13489901
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3771:
--

It does look like that HDFS-2824 should fix this.  Thanks.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-02 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489791#comment-13489791
 ] 

Todd Lipcon commented on HDFS-3771:
---

HDFS-2824 is what fixed this in 2.x

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-02 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489758#comment-13489758
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3771:
--

> No, this should happen automatically with the current trunk code ...

Do you mean HDFS-2093?  HDFS-2093 is also in 0.23.  It seems that HDFS-2093 
does not solve the problem.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-02 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489730#comment-13489730
 ] 

Todd Lipcon commented on HDFS-3771:
---

bq. Do you mean that the operator need to use a hex editor to check the editlog?

No, this should happen automatically with the current trunk code - it would 
validate as an empty log (no transactions) and thus would be moved aside.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-02 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489709#comment-13489709
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3771:
--

{quote}
> However, how to tell if it does not have transactions if the file is not 
> empty?

What do you mean by this? ...
{quote}
It is a typo.  It should read as ".. but the files is not empty".

{quote}
... If it's not empty but has a valid header and an OP_INVALID, ...
{quote}
Do you mean that the operator need to use a hex editor to check the editlog?


> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-02 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489630#comment-13489630
 ] 

Todd Lipcon commented on HDFS-3771:
---

bq. However, how to tell if it does not have transactions if the file is not 
empty?

What do you mean by this? If it's not empty but has a valid header and an 
OP_INVALID, then it is the same as bytewise empty, and can be deleted. If it 
has a partial transaction in it, it should be handled the same way that any 
partial transaction is handled at the end of a file.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-11-01 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489196#comment-13489196
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3771:
--

If there is no transaction in edit_inprogress, it can be safely deleted.  
However, how to tell if it does not have transactions if the file is not empty?

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-21 Thread Bach Bui (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438898#comment-13438898
 ] 

Bach Bui commented on HDFS-3771:


I reproduced this case by simulating the described NN shutdown situation with 
an exit(0) right after jas.startLogSegment(segmentTxId) in 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(long, boolean)

This action in effect created an edit_inprogress file that has no transaction 
in it. NN will now fail to restart, because the error handling code can not 
handle this case.

An easy work around is to delete the edit_inprogress file. As Todd mentioned, 
there will be no loss in data when we do this, am I right Todd?

Ultimately, we need to fix the error handling code in 
org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.LogGroup.planAllInProgressRecovery()
 so that it can detect this situation. It does not seem to be very complicated 
as this is only a conner case. Please correct me if I am wrong.

Could someone also tell me how the NN is shutdown? It seems to me this 
situation only occur if the NN threads are killed without waiting for them to 
cleanup themselves.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-14 Thread patrick white (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434392#comment-13434392
 ] 

patrick white commented on HDFS-3771:
-

Thanks very much Todd, appreciate the feedback and references, and the 
suggestion on using exit to try to reproduce this. 


> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-13 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433466#comment-13433466
 ] 

Todd Lipcon commented on HDFS-3771:
---

Hey Patrick. I think this behavior might have been fixed in 2.0.0 already -- 
the empty file should get properly ignored and the NN should start up.

Perhaps you can instigate this failure again by adding "System.exit(0)" right 
before where {{START_LOG_SEGMENT}} is logged in 
{{startLogSegmentAndWriteHeaderTxn}}. That would allow you to see what the 
right recovery steps are.

The issue seems to be described in HDFS-2093... I think the following comment 
may be relevant:
{quote}
Thus in the situation above, where the only log we have is this corrupted one, 
it will refuse to let the NN start, with a nice message explaining that the 
logs starting at this txid are corrupt with no txns. The operator can then 
double-check whether a different storage drive which possibly went missing 
might have better logs, etc, before starting NN.
{quote}

Looking at your logs, it seems like you have only one edits directory. So the 
above probably applies, and you could successfully start by removing that last 
(empty) log segment.

bq. The larger concern should be for data loss. Based on what happened in this 
case it appears that any pending txids would be lost, unless the edit logs 
could be manually repaired. The filesystem would be intact, only minus the 
changes from the outstanding edit events, does that sound correct?

Only "in-flight" transactions could be lost -- ie those that were never ACKed 
to a client. Anything that has been ACKed would have been fsynced to the log, 
and thus not lost. So, after inspecting the segment to make sure there are 
truly no transactions, you should be able to remove it and start with no data 
loss or corruption whatsoever.


> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-12 Thread patrick white (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432883#comment-13432883
 ] 

patrick white commented on HDFS-3771:
-

Hi Todd, do you mind if i ask your opinion on the potential impact of this 
issue, and possible mitigation for it? So far we've not been able to reproduce 
the issue, either in the nightly regression testing or in specific tests. That, 
combined with regression history, would tend to indicate this is very unlikely 
to occur operationally, however if it would occur the affects could be severe. 
The NN restart should not be difficult once the cause is identified as corrupt 
edits, it should come up once the corrupt logs are cleared. Is that true or am 
i missing an event sequence tracking that would need to be corrected? 

The larger concern should be for data loss. Based on what happened in this case 
it appears that any pending txids would be lost, unless the edit logs could be 
manually repaired. The filesystem would be intact, only minus the changes from 
the outstanding edit events, does that sound correct?
 

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-07 Thread patrick white (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430584#comment-13430584
 ] 

patrick white commented on HDFS-3771:
-

Sorry, wished i could, i got the nn log but didn't get the edit log. i 
initially thought this was coming from a namespace corruption bug and 
redeployed trying to reproduce the issue before narrowing it down to the edit 
logs. If i can reproduce this, i'll be sure to grab the edits.


> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-07 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430575#comment-13430575
 ] 

Todd Lipcon commented on HDFS-3771:
---

Ah, OK, I think the log message thing I mentioned above was a red herring. The 
previous segment _started_ at 23963, and that's what it was logging. Not a 
problem.

Can you upload /grid/[PATH]/edits_inprogress_0023967 which may now 
be renamed with a ".corrupt" suffix of some kind? I want to make sure it is in 
fact empty and not some kind of strange corruption.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-07 Thread patrick white (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430561#comment-13430561
 ] 

patrick white commented on HDFS-3771:
-

Hi Todd, right, the TXIDs are correct (unlike my line item numbering in the 
Description) here's the verbatim log snippet with classpath removed for 
clarity, hostnames and ipaddr generalized;

[org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor@3ecec78d]2012-08-06
 13:02:52,216 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll 
Edit Log from [Secondary NN]
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,217 INFO 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs.
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,217 INFO 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 23963
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,218 INFO 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 4 
Total time for transactions(ms): 1Number of transactions batched in Syncs: 0 
Number of syncs: 5   SyncTimes(ms): 23
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,220 INFO 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 23967
[IPC Server handler 70 on 8020]2012-08-06 13:02:52,234 INFO 
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NameNode at [HOSTNAME/IPADDR]
/
[Thread-1]2012-08-06 13:03:39,287 INFO 
org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = [HOSTNAME/IPADDR]
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.23.3.1208042202

STARTUP_MSG:   classpath = << CLASSPATH REMOVED >>

STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-common-project/hadoop-common
 -r 1368004; compiled by '[QEUSER]' on Sat Aug  4 22:15:58 PDT 2012
 /
  [main]2012-08-06 13:03:39,506 INFO 
org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from 
hadoop-metrics2.properties
  [main]2012-08-06 13:03:39,725 INFO 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_jvm started
  [main]2012-08-06 13:03:39,753 INFO 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_rpc started
  [main]2012-08-06 13:03:39,800 INFO 
org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink simon_dfs started
  [main]2012-08-06 13:03:39,861 INFO 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 
10 second(s).
  [main]2012-08-06 13:03:39,861 INFO 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system 
started
  [main]2012-08-06 13:03:40,342 INFO 
org.apache.hadoop.security.UserGroupInformation: Login successful for user 
hdfs/HOSTNAME@DOMAIN using keytab file [KEYTAB]
  [main]2012-08-06 13:03:40,464 INFO org.apache.hadoop.util.HostsFileReader: 
Refreshing hosts (include/exclude) list
  [main]2012-08-06 13:03:40,469 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: 
dfs.block.invalidate.limit=1000
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: VM type  
 = 64-bit
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: 2% max 
memory = 273.85625 MB
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: capacity 
 = 2^25 = 33554432 entries
  [main]2012-08-06 13:03:40,472 INFO org.apache.hadoop.hdfs.util.GSet: 
recommended=33554432, actual=33554432
  [main]2012-08-06 13:03:40,674 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
dfs.block.access.token.enable=true
  [main]2012-08-06 13:03:40,674 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
dfs.block.access.key.update.interval=600 min(s), 
dfs.block.access.token.lifetime=600 min(s)
  [main]2012-08-06 13:03:40,683 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication 
= 3
  [main]2012-08-06 13:03:40,683 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication 
= 50
  [main]2012-08-06 13:03:40,684 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication 
= 1
  [main]2012-08-06 13:03:40,684 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
maxReplicationStreams  = 2
  [main]2012-08-06 13:03:40,684 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
shouldCheckForEnoughRacks  = true
  [main]2012-08-06 13:03:40,684 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
replicationRecheckInterval = 3000
  [main]2012-08-06 13:03:40,684 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
fsOwner=hdfs/[HOSTN

[jira] [Commented] (HDFS-3771) Namenode can't restart due to corrupt edit logs, timing issue with shutdown and edit log rolling

2012-08-07 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430432#comment-13430432
 ] 

Todd Lipcon commented on HDFS-3771:
---

The following is interesting:
{quote}
3. FSEditLog: Ending log segment 23963
4. FSEditLog: Starting log segment at 23967
{quote}
That's not a typo? i.e there's a gap between the end of the previous segment 
and the start of the next? Perhaps it's just an unrelated logging error, though.

> Namenode can't restart due to corrupt edit logs, timing issue with shutdown 
> and edit log rolling
> 
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: name-node
>Affects Versions: 0.23.3
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs, 
> using Kerberos based security
>Reporter: patrick white
>Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty 
> issue recently, which resulted in the cluster's default Namenode being unable 
> to restart, this was on a 20 node Federated cluster with security. The cause 
> appears to be that the NN was just starting to roll its edit log when a 
> shutdown occurred, the shutdown was intentional to restart the cluster as 
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the 
> cluster was just wrapping up an adminReport subset and this failure case has 
> not reproduce so far, nor was it failing previously. It looks like a chance 
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were 
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at 
> /grid/[PATH]/edits_inprogress_0023967 as corrupt since it has no 
> transactions in it.
> 7. NameNode: Exception in namenode join 
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing 
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs 
> are rolling, does the NN have an equivalent to the conventional fs 'sync' 
> blocking action that should be called, or perhaps has a timing hole?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira