Jeff Saremi created HDFS-14857:
----------------------------------

             Summary: FS operations fail in HA mode: DataNode fails to connect 
to NameNode
                 Key: HDFS-14857
                 URL: https://issues.apache.org/jira/browse/HDFS-14857
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 3.1.0
            Reporter: Jeff Saremi


In an HA configuration, if the NameNodes get restarted and if they're assigned 
new IP addresses, any client FS operation such as a copyFromLocal will fail 
with a message like the following:

{{2019-09-12 18:47:30,544 WARN hdfs.DataStreamer: DataStreamer 
Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/tmp/init.sh._COPYING_ could only be written to 0 of the 1 minReplication 
nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this 
operation.        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2211)
 ...}}

 

Looking at DataNode's stderr shows the following:
 * The heartbeat service detects the IP change and recovers (almost)
 * At this stage, an *hdfs dfsadmin -report* reports all datanodes correctly
 * Once the write begins, the follwoing exception shows up in the datanode log: 
*no route to host*

{{2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in 
offerService2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in 
offerServicejava.io.EOFException: End of File Exception between local host is: 
"storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211"; destination 
host is: "nmnode-0-0.nmnode-0-svc.test.svc.cluster.local":9000; : 
java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at 
org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:789) at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549) at 
org.apache.hadoop.ipc.Client.call(Client.java:1491) at 
org.apache.hadoop.ipc.Client.call(Client.java:1388) at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
 at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:166)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:516)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:646)
 at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847)
 at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException at 
java.io.DataInputStream.readInt(DataInputStream.java:392) at 
org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1850) at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183) at 
org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)}}
{{2019-09-12 01:41:12,273 WARN ipc.Client: Address change detected. Old: 
nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 New: 
nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.220:9000}}{{...}}

 

{{2019-09-12 01:41:12,482 INFO datanode.DataNode: Block pool 
BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 
7673ef28-957a-439f-a721-d47a4a6adb7b) service to 
nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 beginning 
handshake with NN}}
{{2019-09-12 01:41:12,534 INFO datanode.DataNode: Block pool Block pool 
BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 
7673ef28-957a-439f-a721-d47a4a6adb7b) service to 
nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 successfully 
registered with NN}}

 

*NOTE*:  See how when the '{{Address change detected' shows up, the printout 
correctly shows the old and the new address (}}{{10.244.0.220}}{{). }}{{However 
when the registration with NN is complete, the old IP address is still being 
printed (}}{{10.244.0.217) which is showing how cached copies of the IP 
addresses linger on.}}{{}}

 

{{And the following is where the actual error happens preventing any writes to 
FS:}}

 

{{2019-09-12 18:45:29,843 INFO retry.RetryInvocationHandler: 
java.net.NoRouteToHostException: No Route to Host from 
storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211 to 
nmnode-0-1.nmnode-0-svc:50200 failed on socket timeout exception: 
java.net.NoRouteToHostException: No route to host; For more details see: 
http://wiki.apache.org/hadoop/NoRouteToHost, while invoking 
InMemoryAliasMapProtocolClientSideTranslatorPB.read over 
nmnode-0-1.nmnode-0-svc/10.244.0.217:50200 after 3 failover attempts. Trying to 
failover after sleeping for 4452ms.}}{{}}{{}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to