[ 
https://issues.apache.org/jira/browse/HADOOP-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HADOOP-14312:
-----------------------------------
    Status: Patch Available  (was: Open)

> RetryInvocationHandler  may report ANN as SNN in messages.
> ----------------------------------------------------------
>
>                 Key: HADOOP-14312
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14312
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HADOOP-14312.001.patch
>
>
> When multiple threads use the same DFSClient to make RPC calls, they may 
> report incorrect NN host name in messages like
>  INFO [pool-3-thread-13] retry.RetryInvocationHandler 
> (RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete 
> of class ClientNamenodeProtocolTranslatorPB over 
> hdpb-nn0001.prn.parsec.apple.com/*a.b.c.d*:8020. Trying to fail over 
> immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
> where *a.b.c.d* is the active NN, which confuses user to think failover is 
> not behaving correctly.
> The reason is that the ProxyDescriptor data field of RetryInvocationHandler 
> may be shared by multiple threads that do the RPC calls, the failover done by 
> one thread may be visible to other threads when reporting the above kind of 
> message. 
> As an example, 
> # multiple threads start with the same SNN to do RPC calls, 
> # all threads discover that a failover is needed, 
> # thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
> # other threads reports the above message with the proxyInfo changed by 
> thread X, and reported ANN instead of SNN in the message.
> Some details:
> RetryInvocationHandler does the following when failing over:
> {code}
>   synchronized void failover(long expectedFailoverCount, Method method,
>                                int callId) {
>       // Make sure that concurrent failed invocations only cause a single
>       // actual failover.
>       if (failoverCount == expectedFailoverCount) {
>         fpp.performFailover(proxyInfo.proxy);
>         failoverCount++;
>       } else {
>         LOG.warn("A failover has occurred since the start of call #" + callId
>             + " " + proxyInfo.getString(method.getName()));
>       }
>       proxyInfo = fpp.getProxy();
>     }
> {code}
> and changed the proxyInfo in the ProxyDescriptor.
> While the log method below report message with ProxyDescriotor's proxyinfo:
> {code}
> private void log(final Method method, final boolean isFailover,
>       final int failovers, final long delay, final Exception ex) {
> ......
>    final StringBuilder b = new StringBuilder()
>         .append(ex + ", while invoking ")
>         .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
>     if (failovers > 0) {
>       b.append(" after ").append(failovers).append(" failover attempts");
>     }
>     b.append(isFailover? ". Trying to failover ": ". Retrying ");
>     b.append(delay > 0? "after sleeping for " + delay + "ms.": 
> "immediately.");
> {code}
> and so does  {{handleException}} method do
> {code}
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Exception while invoking call #" + callId + " "
>               + proxyDescriptor.getProxyInfo().getString(method.getName())
>               + ". Not retrying because " + retryInfo.action.reason, e);
>         }
> {code}
> FailoverProxyProvider
> {code}
>    public String getString(String methodName) {
>       return proxy.getClass().getSimpleName() + "." + methodName
>           + " over " + proxyInfo;
>     }
>     @Override
>     public String toString() {
>       return proxy.getClass().getSimpleName() + " over " + proxyInfo;
>     }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to