[ 
https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297380#comment-16297380
 ] 

Karthik Palaniappan commented on HADOOP-15129:
----------------------------------------------

[~arpitagarwal]: Here are the notes from our investigation.

startDataNode() calls refreshNamenodes(): 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L1425

refreshNamenodes has unresolved InetSocketAddress(es): 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolManager.java#L152,
 and the unresolved address(es) get passed to a BPOfferService. Note that a 
BPOfferService is only created once for a NameServiceId.

BPOS caches that ISA: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java#L134

And creates a BPServiceActor with that ISA: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPOfferService.java#L138

When the BPServiceActor tries to get NN info, it hits that exception (and logs 
“Problem connecting to server”): 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L235

Here is the stacktrace of the exception:

{code:java}
java.net.UnknownHostException: Invalid host name: local host is: (unknown); 
destination host is: "non-existent":8020; java.net.UnknownHostException; For 
more details see:  http://wiki.apache.org/hadoop/UnknownHost
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.retrieveNamespaceInfo(BPServiceActor.java:221)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:263)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:752)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: Invalid host name: local host is: 
(unknown); destination host is: "non-existent":8020; 
java.net.UnknownHostException; For more details see:  
http://wiki.apache.org/hadoop/UnknownHost
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:744)
        at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:446)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524)
        at org.apache.hadoop.ipc.Client.call(Client.java:1375)
        at org.apache.hadoop.ipc.Client.call(Client.java:1339)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy15.versionRequest(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.versionRequest(DatanodeProtocolClientSideTranslatorPB.java:274)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.retrieveNamespaceInfo(BPServiceActor.java:216)
        ... 3 more
{code}

BPServiceActor uses a while(true) loop that runs every 5 seconds. I don’t think 
it ever stops (which is a little weird).


> Datanode caches namenode DNS lookup failure and cannot startup
> --------------------------------------------------------------
>
>                 Key: HADOOP-15129
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15129
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.8.2
>         Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>            Reporter: Karthik Palaniappan
>            Assignee: Karthik Palaniappan
>            Priority: Minor
>         Attachments: HADOOP-15129.001.patch
>
>
> On startup, the Datanode creates an InetSocketAddress to register with each 
> namenode. Though there are retries on connection failure throughout the 
> stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to 
> IP addresses on construction, and it is never refreshed. Hadoop re-creates an 
> InetSocketAddress in some cases just in case the remote IP has changed for a 
> particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains 
> unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode 
> for null remains unresolved for ID null. Check your hdfs-site.xml file to 
> ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Starting BPOfferServices for nameservices: <default>
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Block pool <registering> (Datanode Uuid unassigned) service to 
> cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if 
> the DN is configured to use a proxy. Since I'm not using a proxy, it forever 
> prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Problem connecting to server: cluster-32f5-m:8020
> {code}
> Unfortunately, the log doesn't contain the exception that triggered it, but 
> the culprit is actually in IPC Client: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
> This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 
> to give a clear error message when somebody mispells an address.
> However, the fix in HADOOP-7472 doesn't apply here, because that code happens 
> in Client#getConnection after the Connection is constructed.
> My proposed fix (will attach a patch) is to move this exception out of the 
> constructor and into a place that will trigger HADOOP-7472's logic to 
> re-resolve addresses. If the DNS failure was temporary, this will allow the 
> connection to succeed. If not, the connection will fail after ipc client 
> retries (default 10 seconds worth of retries).
> I want to fix this in ipc client rather than just in Datanode startup, as 
> this fixes temporary DNS issues for all of Hadoop.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to