[ https://issues.apache.org/jira/browse/HADOOP-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yongjun Zhang updated HADOOP-14312: ----------------------------------- Status: Patch Available (was: Open) > RetryInvocationHandler may report ANN as SNN in messages. > ---------------------------------------------------------- > > Key: HADOOP-14312 > URL: https://issues.apache.org/jira/browse/HADOOP-14312 > Project: Hadoop Common > Issue Type: Bug > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HADOOP-14312.001.patch > > > When multiple threads use the same DFSClient to make RPC calls, they may > report incorrect NN host name in messages like > INFO [pool-3-thread-13] retry.RetryInvocationHandler > (RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete > of class ClientNamenodeProtocolTranslatorPB over > hdpb-nn0001.prn.parsec.apple.com/*a.b.c.d*:8020. Trying to fail over > immediately. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category WRITE is not supported in state standby. Visit > https://s.apache.org/sbnn-error > where *a.b.c.d* is the active NN, which confuses user to think failover is > not behaving correctly. > The reason is that the ProxyDescriptor data field of RetryInvocationHandler > may be shared by multiple threads that do the RPC calls, the failover done by > one thread may be visible to other threads when reporting the above kind of > message. > As an example, > # multiple threads start with the same SNN to do RPC calls, > # all threads discover that a failover is needed, > # thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN > # other threads reports the above message with the proxyInfo changed by > thread X, and reported ANN instead of SNN in the message. > Some details: > RetryInvocationHandler does the following when failing over: > {code} > synchronized void failover(long expectedFailoverCount, Method method, > int callId) { > // Make sure that concurrent failed invocations only cause a single > // actual failover. > if (failoverCount == expectedFailoverCount) { > fpp.performFailover(proxyInfo.proxy); > failoverCount++; > } else { > LOG.warn("A failover has occurred since the start of call #" + callId > + " " + proxyInfo.getString(method.getName())); > } > proxyInfo = fpp.getProxy(); > } > {code} > and changed the proxyInfo in the ProxyDescriptor. > While the log method below report message with ProxyDescriotor's proxyinfo: > {code} > private void log(final Method method, final boolean isFailover, > final int failovers, final long delay, final Exception ex) { > ...... > final StringBuilder b = new StringBuilder() > .append(ex + ", while invoking ") > .append(proxyDescriptor.getProxyInfo().getString(method.getName())); > if (failovers > 0) { > b.append(" after ").append(failovers).append(" failover attempts"); > } > b.append(isFailover? ". Trying to failover ": ". Retrying "); > b.append(delay > 0? "after sleeping for " + delay + "ms.": > "immediately."); > {code} > and so does {{handleException}} method do > {code} > if (LOG.isDebugEnabled()) { > LOG.debug("Exception while invoking call #" + callId + " " > + proxyDescriptor.getProxyInfo().getString(method.getName()) > + ". Not retrying because " + retryInfo.action.reason, e); > } > {code} > FailoverProxyProvider > {code} > public String getString(String methodName) { > return proxy.getClass().getSimpleName() + "." + methodName > + " over " + proxyInfo; > } > @Override > public String toString() { > return proxy.getClass().getSimpleName() + " over " + proxyInfo; > } > {code} > -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org