[ https://issues.apache.org/jira/browse/HDFS-14857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Íñigo Goiri reassigned HDFS-14857: ---------------------------------- Assignee: Jeff Saremi > FS operations fail in HA mode: DataNode fails to connect to NameNode > -------------------------------------------------------------------- > > Key: HDFS-14857 > URL: https://issues.apache.org/jira/browse/HDFS-14857 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.1.0 > Reporter: Jeff Saremi > Assignee: Jeff Saremi > Priority: Major > > In an HA configuration, if the NameNodes get restarted and if they're > assigned new IP addresses, any client FS operation such as a copyFromLocal > will fail with a message like the following: > {{2019-09-12 18:47:30,544 WARN hdfs.DataStreamer: DataStreamer > Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /tmp/init.sh._COPYING_ could only be written to 0 of the 1 minReplication > nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this > operation. at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2211) > ...}} > > Looking at DataNode's stderr shows the following: > * The heartbeat service detects the IP change and recovers (almost) > * At this stage, an *hdfs dfsadmin -report* reports all datanodes correctly > * Once the write begins, the follwoing exception shows up in the datanode > log: *no route to host* > {{2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in > offerService2019-09-12 01:35:11,251 WARN datanode.DataNode: IOException in > offerServicejava.io.EOFException: End of File Exception between local host > is: "storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211"; > destination host is: "nmnode-0-0.nmnode-0-svc.test.svc.cluster.local":9000; : > java.io.EOFException; For more details see: > http://wiki.apache.org/hadoop/EOFException at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at > org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:789) at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549) at > org.apache.hadoop.ipc.Client.call(Client.java:1491) at > org.apache.hadoop.ipc.Client.call(Client.java:1388) at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy17.sendHeartbeat(Unknown Source) at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:166) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:516) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:646) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847) > at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException at > java.io.DataInputStream.readInt(DataInputStream.java:392) at > org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1850) at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)}} > {{2019-09-12 01:41:12,273 WARN ipc.Client: Address change detected. Old: > nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 New: > nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.220:9000}}{{...}} > > {{2019-09-12 01:41:12,482 INFO datanode.DataNode: Block pool > BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid > 7673ef28-957a-439f-a721-d47a4a6adb7b) service to > nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 beginning > handshake with NN}} > {{2019-09-12 01:41:12,534 INFO datanode.DataNode: Block pool Block pool > BP-930210564-10.244.0.216-1568249865477 (Datanode Uuid > 7673ef28-957a-439f-a721-d47a4a6adb7b) service to > nmnode-0-1.nmnode-0-svc.test.svc.cluster.local/10.244.0.217:9000 successfully > registered with NN}} > > *NOTE*: See how when the '{{Address change detected' shows up, the printout > correctly shows the old and the new address (}}{{10.244.0.220}}{{). > }}{{However when the registration with NN is complete, the old IP address is > still being printed (}}{{10.244.0.217) which is showing how cached copies of > the IP addresses linger on.}}{{}} > > {{And the following is where the actual error happens preventing any writes > to FS:}} > > {{2019-09-12 18:45:29,843 INFO retry.RetryInvocationHandler: > java.net.NoRouteToHostException: No Route to Host from > storage-0-0.storage-0-svc.test.svc.cluster.local/10.244.0.211 to > nmnode-0-1.nmnode-0-svc:50200 failed on socket timeout exception: > java.net.NoRouteToHostException: No route to host; For more details see: > http://wiki.apache.org/hadoop/NoRouteToHost, while invoking > InMemoryAliasMapProtocolClientSideTranslatorPB.read over > nmnode-0-1.nmnode-0-svc/10.244.0.217:50200 after 3 failover attempts. Trying > to failover after sleeping for 4452ms.}}{{}}{{}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org