[ 
https://issues.apache.org/jira/browse/HADOOP-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070607#comment-13070607
 ] 

Kihwal Lee commented on HADOOP-7472:
------------------------------------

Description of the manual testing performed.

Set up:
* Namenode runs on "testhost". 
* "testhost" is defined in /etc/hosts. A script switches the IP address of 
testhost, between 127.0.0.1 and the real IP address of the box.
* hostname set to testhost.
* nscd or avahi not running.

Procedure:
* Start HDFS.
* Put files.
* Call a script to perform FS operations using fs shell.
* Kill NN. => operations block. RPC.client retry. This is the conn refused 
case, so they give up quick. The timeout case (node or switch shutdown) will 
last about 15 min until returning exception.
* Before clients giving up, switch the IP address and start namenode.

Result:
* The outstanding calls (in invoke()) get InvalidRPCAddressException.
* The clients that were blocked at the RPC initialization unblock and work.
* The clients that retry on exception will recover on their own.


> RPC client should deal with the IP address changes
> --------------------------------------------------
>
>                 Key: HADOOP-7472
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7472
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>    Affects Versions: 0.20.205.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Minor
>             Fix For: 0.20.205.0
>
>         Attachments: addr_change_dfs-1.patch.txt, addr_change_dfs.patch.txt
>
>
> The current RPC client implementation and the client-side callers assume that 
> the hostname-address mappings of servers never change. The resolved address 
> is stored in an immutable InetSocketAddress object above/outside RPC, and the 
> reconnect logic in the RPC Connection implementation also trusts the resolved 
> address that was passed down.
> If the NN suffers a failure that requires migration, it may be started on a 
> different node with a different IP address. In this case, even if the 
> name-address mapping is updated in DNS, the cluster is stuck trying old 
> address until the whole cluster is restarted.
> The RPC client-side should detect this situation and exit or try to recover.
> Updating ConnectionId within the Client implementation may get the system 
> work for the moment, there always is a risk of the cached address:port become 
> connectable again unintentionally. The real solution will be notifying upper 
> layer of the address change so that they can re-resolve and retry or 
> re-architecture the system as discussed in HDFS-34. 
> For 0.20 lines, some type of compromise may be acceptable. For example, raise 
> a custom exception for some well-defined high-impact upper layer to do 
> re-resolve/retry, while other will have to restart.  For TRUNK, the HA work 
> will most likely determine what needs to be done.  So this Jira won't cover 
> the solutions for TRUNK.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to