[jira] [Commented] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-21 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286697#comment-14286697
 ] 

Konstantin Shvachko commented on HDFS-7611:
---

Byron, good job investigating and reproducing the bug.
Sounds like a serious problem. I also confirmed it on branch-2 with your test.

_So if quotas are enabled a combination of operations *deleteSnapshot* and 
*delete* of a file can leave orphaned blocks in the blocksMap on NameNode 
restart. They are counted as missing on the NameNode, and can prevent NameNode 
from coming out of safeMode and could cause memory leak (at least during 
startup)._

I'll rename the jira and unlink from HDFS-3107 as it is not related to truncate.

> TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
> Cluster to start
> -
>
> Key: HDFS-7611
> URL: https://issues.apache.org/jira/browse/HDFS-7611
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Byron Wong
> Attachments: blocksNotDeletedTest.patch, testTruncateEditLogLoad.log
>
>
> I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
> comming ready after NN restart.
> Not sure if it is truncate specific, as I've seen same behaviour with other 
> tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-21 Thread Byron Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286458#comment-14286458
 ] 

Byron Wong commented on HDFS-7611:
--

Found it.

The problem occurs in how we do {{FSImage$loadEdits}}.
The gist of it looks like:
{code}
private long loadEdits(...) {
  try {
loadEdits();
  } finally {
updateCountForQuota();
  }
}
{code}

In {{TestFileTruncate$testUpgradeAndRestart()}}, notice that we do:
{code}
saveNamespace();
restart();
deleteSnapshot();
{code}

Since there are no edits to load directly after restart, we immediately call 
{{updateCountForQuota()}}, which will set namespace count for the root 
directory from 1 to 5. Then deleting the snapshot will decrement the count from 
5 to 2.

However, we also do a restart in 
{{TestFileTruncate$testTruncateEditLogLoad()}}. In this case, there is an edit 
to replay, namely the {{deleteSnapshot()}}. This will decrement the namespace 
count from 1 to -1, and afterwards {{updateCountForQuota()}} will set it back 
to 2.

> TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
> Cluster to start
> -
>
> Key: HDFS-7611
> URL: https://issues.apache.org/jira/browse/HDFS-7611
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Byron Wong
> Attachments: testTruncateEditLogLoad.log
>
>
> I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
> comming ready after NN restart.
> Not sure if it is truncate specific, as I've seen same behaviour with other 
> tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-20 Thread Konstantin Shvachko (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284651#comment-14284651
 ] 

Konstantin Shvachko commented on HDFS-7611:
---

Was looking at {{TestOpenFilesWithSnapshot}} which also restarts NameNode and 
fails intermittently with the same timeout. I see similar behavior as Byron 
described.
The test creates two files {{/test/test/test2}} and {{/test/test/test3}}, then 
aborts the streams, creates a snapshot, deletes the files, and restarts the the 
NameNode. If any of the replicas of the files were created on any of DNs, then 
the test succeeds. If the stream is aborted before the replicas are created, 
then the test fails.
So some blocks, which were deleted before the NN restart are not being garbage 
collected on restart, and NN cannot get out of safe mode then.
This test does not use truncate, but does use snapshots.

> TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
> Cluster to start
> -
>
> Key: HDFS-7611
> URL: https://issues.apache.org/jira/browse/HDFS-7611
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Byron Wong
> Attachments: testTruncateEditLogLoad.log
>
>
> I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
> comming ready after NN restart.
> Not sure if it is truncate specific, as I've seen same behaviour with other 
> tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7611) TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS Cluster to start

2015-01-20 Thread Byron Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284351#comment-14284351
 ] 

Byron Wong commented on HDFS-7611:
--

This bug happens only in tests with restarts and happens because blocks from 
files created in previous tests are not being deleted when replaying edits logs.
1) I'm still investigating the source of this, but some time while replaying 
edits, {{DirectoryWithSnapshotFeature$cleanDirectory}} can decrement an INode's 
namespace quota to negative. Either the namespace count was overcounting while 
cleaning directories or snapshotDiff, or the INode's namespace quota wasn't 
counted up properly in the first place.
2) If the INode's namespace quota happens to be -1, the blocks associated with 
that inode will not be deleted. When we call {{fsd.removeLastINode(iip)}} in 
{{FSDirDeleteOp$unprotectedDelete}}, we explicitly check whether its return 
code is -1. In that case, we skip collecting the blocks that should be deleted. 
Notice that in {{FSDirectory$removeLastINode}}, one of the possible returns is 
{{return counts.get(Quota.NAMESPACE)}}.
3) Now there are blocks in the blocksMap that shouldn't be there. This will 
increase the number of blocks needed to get out of safeMode. The test failure 
depends on whether the namenode receives these blocks. If it does, then the 
namenode will exit safeMode and the test will suceed.

> TestFileTruncate.testTruncateEditLogLoad times out waiting for Mini HDFS 
> Cluster to start
> -
>
> Key: HDFS-7611
> URL: https://issues.apache.org/jira/browse/HDFS-7611
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Konstantin Shvachko
>Assignee: Byron Wong
> Attachments: testTruncateEditLogLoad.log
>
>
> I've seen it failing on Jenkins a couple of times. Somehow the cluster is not 
> comming ready after NN restart.
> Not sure if it is truncate specific, as I've seen same behaviour with other 
> tests that restart the NameNode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)