[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436301#comment-13436301 ] Aaron T. Myers commented on HDFS-2966: -- Hey Trevor, please open a new issue for it. Thanks a lot. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.0.0-alpha > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.2.0-alpha > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434660#comment-13434660 ] Trevor Robinson commented on HDFS-2966: --- I hit this on my last 2 builds of trunk. I don't see an open issue on it, so should I create a new issue or reopen this one (or HDFS-540)? {noformat} testCorruptBlock(org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics) Time elapsed: 7.082 sec <<< FAILURE! java.lang.AssertionError: Bad value for metric PendingReplicationBlocks expected:<0> but was:<1> at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.apache.hadoop.test.MetricsAsserts.assertGauge(MetricsAsserts.java:191) at org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics.testCorruptBlock(TestNameNodeMetrics.java:186) {noformat} > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.0.0-alpha > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.2.0-alpha > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226857#comment-13226857 ] Hudson commented on HDFS-2966: -- Integrated in Hadoop-Mapreduce-trunk #1015 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1015/]) HDFS-2966 (Revision 1298820) Result = SUCCESS stevel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1298820 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226836#comment-13226836 ] Hudson commented on HDFS-2966: -- Integrated in Hadoop-Hdfs-trunk #980 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/980/]) HDFS-2966 (Revision 1298820) Result = SUCCESS stevel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1298820 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226082#comment-13226082 ] Hudson commented on HDFS-2966: -- Integrated in Hadoop-Mapreduce-trunk-Commit #1865 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1865/]) HDFS-2966 (Revision 1298820) Result = ABORTED stevel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1298820 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226061#comment-13226061 ] Hudson commented on HDFS-2966: -- Integrated in Hadoop-Hdfs-trunk-Commit #1931 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1931/]) HDFS-2966 (Revision 1298820) Result = SUCCESS stevel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1298820 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226062#comment-13226062 ] Hudson commented on HDFS-2966: -- Integrated in Hadoop-Common-trunk-Commit #1856 (See [https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1856/]) HDFS-2966 (Revision 1298820) Result = SUCCESS stevel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1298820 Files : * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/metrics/TestNameNodeMetrics.java > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222611#comment-13222611 ] Hadoop QA commented on HDFS-2966: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12517086/HDFS-2966.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1952//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1952//console This message is automatically generated. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0, 0.23.2 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222543#comment-13222543 ] Aaron T. Myers commented on HDFS-2966: -- +1, pending Jenkins. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0, 0.23.2 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch, > HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216230#comment-13216230 ] Aaron T. Myers commented on HDFS-2966: -- bq. Rather than rename the method, I made the metric scope a parameter, so it can block for other metrics too. I still find the name a tad misleading, since the method is still a little DN-specific. e.g. it retries based on the number of DNs, and sleeps a multiple of DFS_REPLICATION_INTERVAL. But, I don't feel super strongly about this point. Take it or leave it. +1, the patch looks good to me, assuming you don't want to address the above feedback. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0, 0.23.2 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216223#comment-13216223 ] Hadoop QA commented on HDFS-2966: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12516000/HDFS-2966.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1910//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1910//console This message is automatically generated. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0, 0.23.2 > > Attachments: HDFS-2966.patch, HDFS-2966.patch, HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214891#comment-13214891 ] Suresh Srinivas commented on HDFS-2966: --- Steve, please see HDFS-3002 - comment https://issues.apache.org/jira/browse/HDFS-3002?focusedCommentId=13214890&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13214890 > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 0.24.0, 0.23.2 > > Attachments: HDFS-2966.patch, HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212194#comment-13212194 ] Aaron T. Myers commented on HDFS-2966: -- We can forego #3 for this JIRA, if you want. I may switch the whole test to use separate mini clusters in HDFS-2978 anyway, which would render this point moot. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212091#comment-13212091 ] Steve Loughran commented on HDFS-2966: -- point 1: probably trying to reduce change just to make sure I didn't accidentally remove an assertion. I will pull it. point 2: seems good. point 3: when I stripped it down too much the following tests fail -state propagates from one to the other. Moving to separate mini clusters could fix that but it would make things slower. That leaves "adding poll loops before each test case to ensure the FS is in the stable state before each test run. That's a harder thing to do and maybe something that can be put off unless this patch doesn't solve most people's problems. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212024#comment-13212024 ] Aaron T. Myers commented on HDFS-2966: -- Hey Steve, patch looks pretty good. I agree this issue could stand to be improved. I've also seen spurious failures in this test. A few comments: # In the spot where you call waitForGaugeValue for "FilesTotal", you also unnecessarily assert the value for FilesTotal. # The name "waitForGaugeValue" seems a little misleading, since it's not a general-purpose method for gauges, but rather somewhat specific to gauges that are a function of _DN metrics_. Perhaps consider renaming it to something like "waitForDnMetricValue" ? # Though the patch manages to get rid of the most race-prone sleeps (DN metrics), I don't think it will necessarily completely solve the issue for very slow VMs, since there are still several calls to updateMetrics. Can we completely remove the need for updateMetrics in this test, by waiting for a specific value as you've done here? > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210998#comment-13210998 ] Hadoop QA commented on HDFS-2966: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515090/HDFS-2966.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/1885//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/1885//console This message is automatically generated. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210972#comment-13210972 ] Steve Loughran commented on HDFS-2966: -- patch applies sleep (for the same delay as before) then poll+sleep for a limited set of retries before giving up. Provide the assertions are failing on the exit of the wait cycle, rather than on the initial state of the tests, this polling should significantly reduce the probability of failure under load. To reassure anyone worried that this polling would slow down the test run on a machine not under load, this does not appear to be the case. Before {code} Running org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 47.042 sec {code} On two runs after making changes, the elapsed times were 46.709 sec and 42.995 sec. This implies it takes about the same time. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Attachments: HDFS-2966.patch > > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210969#comment-13210969 ] Steve Loughran commented on HDFS-2966: -- A problem here is that the tests are not independent -the fs events from the previous test can still be trickling through the filesystem when the next test starts running. A simple poll/sleep cycle actually behaves worse, because it can exit too early; the state of the previous test is still there and the more recent changes aren't yet in the metrics. A sleep+ followup poll cycle would appear to be a better process, though it may still have problems under load that movind to per-test mini HDFS clusters would be required to fix. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210888#comment-13210888 ] Steve Loughran commented on HDFS-2966: -- My planned solution to this is move from sleep-then-assert to sleep-poll-repeat for a longer period of time. If the state is reached sooner, the test finishes earlier, but if the machine is overloaded the test will stretch out. This may make it faster on some machines, as well as less brittle on others. > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Priority: Minor > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work on a > lightly-loaded system, but not on a system under heavy load where it is > taking longer to replicate than expect. > Immediate fix: double, triple, the sleep time? > Better fix: have the thread block until all the DN heartbeats have finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2966) TestNameNodeMetrics tests can fail under load
[ https://issues.apache.org/jira/browse/HDFS-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210421#comment-13210421 ] Steve Loughran commented on HDFS-2966: -- stack trace against trunk {code} --- Test set: org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics --- Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 53.526 sec <<< FAILURE! testCorruptBlock(org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics) Time elapsed: 8.968 sec <<< FAILURE! java.lang.AssertionError: Bad value for metric PendingReplicationBlocks expected:<0> but was:<1> at org.junit.Assert.fail(Assert.java:91) at org.junit.Assert.failNotEquals(Assert.java:645) at org.junit.Assert.assertEquals(Assert.java:126) at org.junit.Assert.assertEquals(Assert.java:470) at org.apache.hadoop.test.MetricsAsserts.assertGauge(MetricsAsserts.java:185) at org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics.testCorruptBlock(TestNameNodeMetrics.java:185) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164) at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110) at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175) at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:81) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68) {code} > TestNameNodeMetrics tests can fail under load > - > > Key: HDFS-2966 > URL: https://issues.apache.org/jira/browse/HDFS-2966 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 0.24.0 > Environment: OS/X running intellij IDEA, firefox, winxp in a > virtualbox. >Reporter: Steve Loughran >Priority: Minor > > I've managed to recreate HDFS-540 and HDFS-2434 by the simple technique of > running the HDFS tests on a desktop with out enough memory for all the > programs trying to run. Things got swapped out and the tests failed as the DN > heartbeats didn't come in on time. > the tests both rely on {{waitForDeletion()}} to block the tests until the > delete operation has completed, but all it does is sleep for the same number > of seconds as there are datanodes. This is too brittle -it may work o