[ 
https://issues.apache.org/jira/browse/HDFS-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387816#comment-14387816
 ] 

Jing Zhao commented on HDFS-7858:
---------------------------------

Thanks for working on this, [~asuresh].

One concern is where we should put the new logic. Looks like the current patch 
wraps things in the following way: 

{{RequestHedgingInvocationHandler}} --> proxy returned by 
{{RequestHedgingProxyProvider#getProxy}} --> {{RetryInvocationHandler}}

I'm not sure if this is the best way to go. RetryInvocationHandler has its own 
logic for retry and failover, which is usually based on the type of the 
exception thrown by the invocation. With the new design, the exception caught 
by {{RetryInvocationHandler}} is identified based on the exceptions thrown by 
all the targets inside of {{RequestHedgingInvocationHandler}}. Since different 
targets may return different exceptions, looks like we cannot guarantee 
{{RetryInvocationHandler}} finally gets the exception from the correct target.

I'm thinking that how about providing {{RequestHedgingInvocationHandler}} as a 
replacement of {{RetryInvocationHandler}}? We need to add the retry logic into 
{{RequestHedgingInvocationHandler}} but the whole layer may look more clean.

> Improve HA Namenode Failover detection on the client
> ----------------------------------------------------
>
>                 Key: HDFS-7858
>                 URL: https://issues.apache.org/jira/browse/HDFS-7858
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: HDFS-7858.1.patch, HDFS-7858.2.patch, HDFS-7858.2.patch, 
> HDFS-7858.3.patch
>
>
> In an HA deployment, Clients are configured with the hostnames of both the 
> Active and Standby Namenodes.Clients will first try one of the NNs 
> (non-deterministically) and if its a standby NN, then it will respond to the 
> client to retry the request on the other Namenode.
> If the client happens to talks to the Standby first, and the standby is 
> undergoing some GC / is busy, then those clients might not get a response 
> soon enough to try the other NN.
> Proposed Approach to solve this :
> 1) Since Zookeeper is already used as the failover controller, the clients 
> could talk to ZK and find out which is the active namenode before contacting 
> it.
> 2) Long-lived DFSClients would have a ZK watch configured which fires when 
> there is a failover so they do not have to query ZK everytime to find out the 
> active NN
> 2) Clients can also cache the last active NN in the user's home directory 
> (~/.lastNN) so that short-lived clients can try that Namenode first before 
> querying ZK



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to