Jeff Saremi created HDFS-14857: ---------------------------------- Summary: FS operations fail in HA mode: DataNode fails to connect to NameNode Key: HDFS-14857 URL: https://issues.apache.org/jira/browse/HDFS-14857 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.1.0 Reporter: Jeff Saremi
In an HA configuration, if the NameNodes get restarted and if they're assigned new IP addresses, any client FS operation such as a copyFromLocal will fail with a message like the following: {{2019-09-12 18:47:30,544 WARN hdfs.DataStreamer: DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/init.sh._COPYING_ could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2211) ...}} Looking at DataNode's stderr shows the following: * The heartbeat service detects the IP change and recovers (almost) * At this stage, an *hdfs dfsadmin -report* reports all datanodes correctly * Once the write begins, the follwoing exception shows up in the datanode log: *no route to host* {{2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in offerService2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in offerServicejava.io.EOFException: End of File Exception between local host is: "storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211"; destination host is: "nmnode-0-0.nmnode-0-svc.test.svc.cluster.local":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:789) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549) at org.apache.hadoop.ipc.Client.call(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1388) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:166) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:516) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:646) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847) at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1850) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)}} {{2019-09-12 01:41:12,273 WARN ipc.Client: Address change detected. Old: nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 New: nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.220:9000}}{{...}} {{2019-09-12 01:41:12,482 INFO datanode.DataNode: Block pool BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 7673ef28-957a-439f-a721-d47a4a6adb7b) service to nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 beginning handshake with NN}} {{2019-09-12 01:41:12,534 INFO datanode.DataNode: Block pool Block pool BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid 7673ef28-957a-439f-a721-d47a4a6adb7b) service to nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 successfully registered with NN}} *NOTE*: See how when the '{{Address change detected' shows up, the printout correctly shows the old and the new address (}}{{10.244.0.220}}{{). }}{{However when the registration with NN is complete, the old IP address is still being printed (}}{{10.244.0.217) which is showing how cached copies of the IP addresses linger on.}}{{}} {{And the following is where the actual error happens preventing any writes to FS:}} {{2019-09-12 18:45:29,843 INFO retry.RetryInvocationHandler: java.net.NoRouteToHostException: No Route to Host from storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211 to nmnode-0-1.nmnode-0-svc:50200 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost, while invoking InMemoryAliasMapProtocolClientSideTranslatorPB.read over nmnode-0-1.nmnode-0-svc/10.244.0.217:50200 after 3 failover attempts. Trying to failover after sleeping for 4452ms.}}{{}}{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org