[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423118#comment-15423118 ] Zhe Zhang commented on HDFS-10270: -- I just committed the patch to branch-2.7 as part of an effort of backporting HDFS-7933. The patch could only possibly affect {{TestJMXGet}} and I verified locally that it passes fine with the patch; the patch actually only makes the test easier to pass. > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 2.8.0, 3.0.0-alpha1 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Fix For: 2.8.0, 2.7.3 > > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239566#comment-15239566 ] Hudson commented on HDFS-10270: --- FAILURE: Integrated in Hadoop-trunk-Commit #9603 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9603/]) HDFS-10270. TestJMXGet:testNameNode() fails. Contributed by Gergely (kihwal: rev d2f3bbc29046435904ad9418073795439c71b441) * hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/tools/TestJMXGet.java > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Fix For: 2.8.0 > > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239532#comment-15239532 ] Kihwal Lee commented on HDFS-10270: --- bq. Do you see any disadvantages removing NumOpenConnections check? I was trying to keep whatever the original author did, but in term of the value of the check, I think it is fine to remove. So +1 for the patch. > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239525#comment-15239525 ] Kihwal Lee commented on HDFS-10270: --- bq. In addition playing with timeouts makes a JUnit test more complicated than a JUnit test should be. What is you opinion about these? Totally agree, but in reality hadoop unit tests are riddled with unreliable timeouts. Many of them are not really unit tests. They exist because of many reasons including Inherent difficulties in unit testing distributed functionalities, test unfriendly designs and simply bad test designs. Many are aware of the issue. > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239238#comment-15239238 ] Andras Bokor commented on HDFS-10270: - In addition playing with timeouts makes a JUnit test more complicated than a JUnit test should be. What is you opinion about these? > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239217#comment-15239217 ] Andras Bokor commented on HDFS-10270: - Hi [~kihwal], We do not see the value to check NumOpenConnections since it can cause intermittent failures and with the other checks the JMX metrics are tested enough. Our suggestion is to remove NumOpenConnections check. Do you see any disadvantages removing NumOpenConnections check? > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237674#comment-15237674 ] Kihwal Lee commented on HDFS-10270: --- If adding a new test resource, don't forget to add it to the exclude list for apache-rat-plugin. > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237396#comment-15237396 ] Kihwal Lee commented on HDFS-10270: --- It is not for testing the client connection being up. It is simply checking one of the metrics values reported in JMX. I don't know the reason why NumOpenConnections was chosen. The test had worked *reliably* until the jmx caching was fixed. The values used to be available right away, but now it takes about 10 seconds. So when it's working it adds about 10 more seconds of delay. But the original author also made a wrong assumption. The assumption was that the reason for the number of connections being 2 is due to having two datanodes. As you have correctly analyzed, this is not true in a MiniDFSCluster. Since the two datanodes are sharing the same JVM, a single connection was shared for the {{DatanodeProtocol}}. An additional connection was made for the client. In a real distributed cluster, it would have been 3 connections. I lean toward fixing the existing check than removing it. First it shouldn't check against the number of datanods, but simply 2. Regarding increasing ipc client idle timeout, it will make test run time longer, which is against what we have been trying to do. An alternative is to add a test resource to reduce the jmx update interval. We could add a {{hadoop-hdfs-project/hadoop-hdfs/src/test/resources/hadoop-metrics2.properties}} file with one line containing {{*.period=1}}. This will also reduce the run time of a number of test cases that query jmx to verify the result. What do you think? > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236790#comment-15236790 ] Gergely Novák commented on HDFS-10270: -- So [~kihwal], are you suggesting to have a wait/assert for checking if the number of connections is equal to 2 (datanode, client)? And to fulfil this to increase the ipc client idle timeout? If so, can you (or the original author of the test) explain why we need to test if the client is still connected? > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235349#comment-15235349 ] Kihwal Lee commented on HDFS-10270: --- It looks like a race was introduced after the jmx caching was "fixed". Certain metrics values in JMX is now using the cached value, which expires in 10 seconds by default. Since the ipc client idle timeout is also 10 seconds, The ipc connection may get closed before the jmx refresh. So the number of actual connections reaches to 2, but can drop to 1 just before a new jmx data is fetched. We could fix this by either lowering the jmx cache expiration or increasing ipc client idle timeout. Since it is a JMX test, it might be better to leave the jmx setting as default and change the ipc timeout to, say, 15 seconds. > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Affects Versions: 3.0.0, 2.8.0 >Reporter: Andras Bokor >Assignee: Gergely Novák >Priority: Minor > Attachments: HDFS-10270.001.patch, TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235117#comment-15235117 ] Hadoop QA commented on HDFS-10270: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 24s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 52s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 56s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 56s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 1s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 103m 25s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_77. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 16s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 231m 44s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_77 Failed junit tests | hadoop.hdfs.TestDFSUpgradeFromImage | | | hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations | | | hadoop.hdfs.server.datanode.TestDataNodeMetrics | | | hadoop.hdfs.security.TestDelegationTokenForProxyUser | | | hadoop.hdfs.server.datanode.TestDataNodeMXBean | | | hadoop.hdfs.server.namenode.ha.TestRequestHedgingProxyProvider | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.fs.contract.hdfs.TestHDFSContractSeek | | JDK v1.7.0_95 Failed junit tests | hadoop.hdfs.TestDFSUpgradeFromImage | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.server.namen
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234823#comment-15234823 ] Gergely Novák commented on HDFS-10270: -- We observed that the fail is caused by this assert: {code} DFSTestUtil.waitForMetric(jmx, "NumOpenConnections", numDatanodes); {code} This checks if the number of open connections equals to the number of data nodes. But the number of open connections has absolutely no dependency from the data nodes: it's either 0, 1 (DataNodeProtocol) or 2 (DataNodeProtocol and ClientProtocol). The test passes in those rare cases when the ClientProtocol hasn't timed out when the {{TestNameNode}} runs (this can only happen if the tests are run separately). If we increment the number of data nodes (to 3, or so) the test will always fail. Contrarily if we increase the client timeout ({{ipc.client.connection.maxidletime}}) the test will always pass. Our suggestion is to remove this assert. > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Andras Bokor >Priority: Minor > Attachments: TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10270) TestJMXGet:testNameNode() fails
[ https://issues.apache.org/jira/browse/HDFS-10270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15231868#comment-15231868 ] Andras Bokor commented on HDFS-10270: - I found that it fails when I run the two tests together so testDataNode() runs before testNameNode(). If I run only testNameNode or I change the order the tests pass. I debugged a bit the failed run. It opens two RPC connections but later on it closes one of them in Server class doRead(SelectionKey key) method: {code} if (count < 0) { closeConnection(c); c = null; } {code} Here the count is -1. We get -1 in Server:channelRead(ReadableByteChannel channel, ByteBuffer buffer) method: {code} int count = (buffer.remaining() <= NIO_BUFFER_LIMIT) ? channel.read(buffer) : channelIO(channel, null, buffer); {code} In this condition the true branch will run. Can somebody help us on this? > TestJMXGet:testNameNode() fails > --- > > Key: HDFS-10270 > URL: https://issues.apache.org/jira/browse/HDFS-10270 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Andras Bokor >Priority: Minor > Attachments: TestJMXGet.log, TestJMXGetFails.log > > > It fails with java.util.concurrent.TimeoutException. Actually the problem > here is that we expect 2 as NumOpenConnections metric but it is only 1. So > the test waits 60 sec then fails. > Please find maven output so the stack trace attached ([^TestJMXGetFails.log]). -- This message was sent by Atlassian JIRA (v6.3.4#6332)