[ https://issues.apache.org/jira/browse/HADOOP-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070607#comment-13070607 ]
Kihwal Lee commented on HADOOP-7472: ------------------------------------ Description of the manual testing performed. Set up: * Namenode runs on "testhost". * "testhost" is defined in /etc/hosts. A script switches the IP address of testhost, between 127.0.0.1 and the real IP address of the box. * hostname set to testhost. * nscd or avahi not running. Procedure: * Start HDFS. * Put files. * Call a script to perform FS operations using fs shell. * Kill NN. => operations block. RPC.client retry. This is the conn refused case, so they give up quick. The timeout case (node or switch shutdown) will last about 15 min until returning exception. * Before clients giving up, switch the IP address and start namenode. Result: * The outstanding calls (in invoke()) get InvalidRPCAddressException. * The clients that were blocked at the RPC initialization unblock and work. * The clients that retry on exception will recover on their own. > RPC client should deal with the IP address changes > -------------------------------------------------- > > Key: HADOOP-7472 > URL: https://issues.apache.org/jira/browse/HADOOP-7472 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc > Affects Versions: 0.20.205.0 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Minor > Fix For: 0.20.205.0 > > Attachments: addr_change_dfs-1.patch.txt, addr_change_dfs.patch.txt > > > The current RPC client implementation and the client-side callers assume that > the hostname-address mappings of servers never change. The resolved address > is stored in an immutable InetSocketAddress object above/outside RPC, and the > reconnect logic in the RPC Connection implementation also trusts the resolved > address that was passed down. > If the NN suffers a failure that requires migration, it may be started on a > different node with a different IP address. In this case, even if the > name-address mapping is updated in DNS, the cluster is stuck trying old > address until the whole cluster is restarted. > The RPC client-side should detect this situation and exit or try to recover. > Updating ConnectionId within the Client implementation may get the system > work for the moment, there always is a risk of the cached address:port become > connectable again unintentionally. The real solution will be notifying upper > layer of the address change so that they can re-resolve and retry or > re-architecture the system as discussed in HDFS-34. > For 0.20 lines, some type of compromise may be acceptable. For example, raise > a custom exception for some well-defined high-impact upper layer to do > re-resolve/retry, while other will have to restart. For TRUNK, the HA work > will most likely determine what needs to be done. So this Jira won't cover > the solutions for TRUNK. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira