[ 
https://issues.apache.org/jira/browse/HDFS-7527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248411#comment-14248411
 ] 

Binglin Chang commented on HDFS-7527:
-------------------------------------

Read some related code, the test is intended to test dfs.host list can support  
dfs.datanode.hostname (e.g. if you set adatanode's name to host1, and dfs.host 
file contains host1, this datanode should be able to connect to namenode).
But after reading to code, turns out DatanodeManager check dfs.host list only 
using ip address, not hostname(namenode resolve all hostnames in dfs.host file 
to ip address), so this test should fail as the expect behavior. 
The reason the test passes most of the time is because the code is missing 
proper waiting to make sure the old datanode is expired.

{code}
    refreshNodes(cluster.getNamesystem(0), hdfsConf);
    cluster.restartDataNode(0);
   
    // there should be some wait time before the original datanode becoming 
dead, 
    // or the following checking code will always success, because old datanode 
is still alive

    // Wait for the DN to come back.
    while (true) {
      DatanodeInfo info[] = client.datanodeReport(DatanodeReportType.LIVE);
      if (info.length == 1) {
        Assert.assertFalse(info[0].isDecommissioned());
        Assert.assertFalse(info[0].isDecommissionInProgress());
        assertEquals(registrationName, info[0].getHostName());
        break;
      }
      LOG.info("Waiting for datanode to come back");
      Thread.sleep(HEARTBEAT_INTERVAL * 1000);
    }
{code}

I added some sleep time in the comment above, and the test always fail, which 
verify my theory.
Since the test is not valid, I think we should just remove it.
 



> TestDecommission.testIncludeByRegistrationName fails occassionally in trunk
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-7527
>                 URL: https://issues.apache.org/jira/browse/HDFS-7527
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode, test
>            Reporter: Yongjun Zhang
>            Assignee: Binglin Chang
>
> https://builds.apache.org/job/Hadoop-Hdfs-trunk/1974/testReport/
> {quote}
> Error Message
> test timed out after 360000 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 360000 milliseconds
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName(TestDecommission.java:957)
> 2014-12-15 12:00:19,958 ERROR datanode.DataNode 
> (BPServiceActor.java:run(836)) - Initialization failed for Block pool 
> BP-887397778-67.195.81.153-1418644469024 (Datanode Uuid null) service to 
> localhost/127.0.0.1:40565 Datanode denied communication with namenode because 
> the host is not in the include-list: DatanodeRegistration(127.0.0.1, 
> datanodeUuid=55d8cbff-d8a3-4d6d-ab64-317fff0ee279, infoPort=54318, 
> infoSecurePort=0, ipcPort=43726, 
> storageInfo=lv=-56;cid=testClusterID;nsid=903754315;c=0)
>       at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:915)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:4402)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1196)
>       at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:92)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26296)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:637)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:966)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2127)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2123)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2121)
> 2014-12-15 12:00:29,087 FATAL datanode.DataNode 
> (BPServiceActor.java:run(841)) - Initialization failed for Block pool 
> BP-887397778-67.195.81.153-1418644469024 (Datanode Uuid null) service to 
> localhost/127.0.0.1:40565. Exiting. 
> java.io.IOException: DN shut down before block pool connected
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.retrieveNamespaceInfo(BPServiceActor.java:186)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:216)
>       at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:829)
>       at java.lang.Thread.run(Thread.java:745)
> {quote}
> Found by tool proposed in HADOOP-11045:
> {quote}
> [yzhang@localhost jenkinsftf]$ ./determine-flaky-tests-hadoop.py -j 
> Hadoop-Hdfs-trunk -n 5 | tee bt.log
> ****Recently FAILED builds in url: 
> https://builds.apache.org//job/Hadoop-Hdfs-trunk
>     THERE ARE 4 builds (out of 6) that have failed tests in the past 5 days, 
> as listed below:
> ===>https://builds.apache.org/job/Hadoop-Hdfs-trunk/1974/testReport 
> (2014-12-15 03:30:01)
>     Failed test: 
> org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
>     Failed test: 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
>     Failed test: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
> ===>https://builds.apache.org/job/Hadoop-Hdfs-trunk/1972/testReport 
> (2014-12-13 10:32:27)
>     Failed test: 
> org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
> ===>https://builds.apache.org/job/Hadoop-Hdfs-trunk/1971/testReport 
> (2014-12-13 03:30:01)
>     Failed test: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
> ===>https://builds.apache.org/job/Hadoop-Hdfs-trunk/1969/testReport 
> (2014-12-11 03:30:01)
>     Failed test: 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
>     Failed test: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
>     Failed test: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testFailoverRightBeforeCommitSynchronization
> Among 6 runs examined, all failed tests <#failedRuns: testName>:
>     3: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA.testUpdatePipeline
>     2: org.apache.hadoop.hdfs.TestDecommission.testIncludeByRegistrationName
>     2: 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testNumVersionsReportedCorrect
>     1: 
> org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testFailoverRightBeforeCommitSynchronization
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to