[ 
https://issues.apache.org/jira/browse/HDFS-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjun Zhang updated HDFS-11656:
---------------------------------
    Description: 
When multiple threads use the same DFSClient to make RPC calls, they may report 
incorrect NN host name in messages like

 INFO [pool-3-thread-13] retry.RetryInvocationHandler 
(RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete of 
class ClientNamenodeProtocolTranslatorPB over 
hdpb-nn0001.prn.parsec.apple.com/*a.b.c.d*:8020. Trying to fail over 
immediately.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
https://s.apache.org/sbnn-error

where *a.b.c.d* is the active NN, which confuses user to think failover is not 
behaving correctly.

The reason is that the ProxyDescriptor data field of RetryInvocationHandler may 
be shared by multiple threads that do the RPC calls, the failover done by one 
thread may be visible to other threads when reporting the above kind of 
message. 

As an example, 
# multiple threads start with the same SNN to do RPC calls, 
# all threads discover that a failover is needed, 
# thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
# other threads reports the above message with the proxyInfo changed by thread 
X, and reported ANN instead of SNN in the message.

Some details:

RetryInvocationHandler does the following when failing over:
{code}
  synchronized void failover(long expectedFailoverCount, Method method,
                               int callId) {
      // Make sure that concurrent failed invocations only cause a single
      // actual failover.
      if (failoverCount == expectedFailoverCount) {
        fpp.performFailover(proxyInfo.proxy);
        failoverCount++;
      } else {
        LOG.warn("A failover has occurred since the start of call #" + callId
            + " " + proxyInfo.getString(method.getName()));
      }
      proxyInfo = fpp.getProxy();
    }
{code}
and changed the proxyInfo in the ProxyDescriptor.

While the log method below report message with ProxyDescriotor's proxyinfo:
{code}
private void log(final Method method, final boolean isFailover,
      final int failovers, final long delay, final Exception ex) {
......
   final StringBuilder b = new StringBuilder()
        .append(ex + ", while invoking ")
        .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
    if (failovers > 0) {
      b.append(" after ").append(failovers).append(" failover attempts");
    }
    b.append(isFailover? ". Trying to failover ": ". Retrying ");
    b.append(delay > 0? "after sleeping for " + delay + "ms.": "immediately.");
{code}
and so does  {{handleException}} method do
{code}
        if (LOG.isDebugEnabled()) {
          LOG.debug("Exception while invoking call #" + callId + " "
              + proxyDescriptor.getProxyInfo().getString(method.getName())
              + ". Not retrying because " + retryInfo.action.reason, e);
        }
{code}

FailoverProxyProvider
{code}
   public String getString(String methodName) {
      return proxy.getClass().getSimpleName() + "." + methodName
          + " over " + proxyInfo;
    }

    @Override
    public String toString() {
      return proxy.getClass().getSimpleName() + " over " + proxyInfo;
    }
{code}
 


  was:
When multiple threads use the same DFSClient to make RPC calls, they may report 
incorrect NN host name in messages like

 INFO [pool-3-thread-13] retry.RetryInvocationHandler 
(RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete of 
class ClientNamenodeProtocolTranslatorPB over 
hdpb-nn0001.prn.parsec.apple.com/*a.b.c.d*:8020. Trying to fail over 
immediately.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): 
Operation category WRITE is not supported in state standby. Visit 
https://s.apache.org/sbnn-error

where *a.b.c.d* is the active NN, which confuses user to think failover is not 
behaving correctly.

The reason is that the ProxyDescriptor data field of RetryInvocationHandler may 
be shared by multiple threads that do the RPC calls, the failover done by one 
thread may be visible to other threads when reporting the above kind of 
message. 

As an example, 
# multiple threads start with the same SNN to to the call, 
# all threads discover that a failover is needed, 
# thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
# other threads reports the above message with the proxyInfo changed by thread 
X, and reported ANN instead of SNN in the message.

Some details:

RetryInvocationHandler does the following when failing over:
{code}
  synchronized void failover(long expectedFailoverCount, Method method,
                               int callId) {
      // Make sure that concurrent failed invocations only cause a single
      // actual failover.
      if (failoverCount == expectedFailoverCount) {
        fpp.performFailover(proxyInfo.proxy);
        failoverCount++;
      } else {
        LOG.warn("A failover has occurred since the start of call #" + callId
            + " " + proxyInfo.getString(method.getName()));
      }
      proxyInfo = fpp.getProxy();
    }
{code}
and changed the proxyInfo in the ProxyDescriptor.

While the log method below report message with ProxyDescriotor's proxyinfo:
{code}
private void log(final Method method, final boolean isFailover,
      final int failovers, final long delay, final Exception ex) {
......
   final StringBuilder b = new StringBuilder()
        .append(ex + ", while invoking ")
        .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
    if (failovers > 0) {
      b.append(" after ").append(failovers).append(" failover attempts");
    }
    b.append(isFailover? ". Trying to failover ": ". Retrying ");
    b.append(delay > 0? "after sleeping for " + delay + "ms.": "immediately.");
{code}
and so does  {{handleException}} method do
{code}
        if (LOG.isDebugEnabled()) {
          LOG.debug("Exception while invoking call #" + callId + " "
              + proxyDescriptor.getProxyInfo().getString(method.getName())
              + ". Not retrying because " + retryInfo.action.reason, e);
        }
{code}

FailoverProxyProvider
{code}
   public String getString(String methodName) {
      return proxy.getClass().getSimpleName() + "." + methodName
          + " over " + proxyInfo;
    }

    @Override
    public String toString() {
      return proxy.getClass().getSimpleName() + " over " + proxyInfo;
    }
{code}
 



> RetryInvocationHandler  may report ANN as SNN in messages.
> ----------------------------------------------------------
>
>                 Key: HDFS-11656
>                 URL: https://issues.apache.org/jira/browse/HDFS-11656
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Yongjun Zhang
>
> When multiple threads use the same DFSClient to make RPC calls, they may 
> report incorrect NN host name in messages like
>  INFO [pool-3-thread-13] retry.RetryInvocationHandler 
> (RetryInvocationHandler.java:invoke(148)) - Exception while invoking delete 
> of class ClientNamenodeProtocolTranslatorPB over 
> hdpb-nn0001.prn.parsec.apple.com/*a.b.c.d*:8020. Trying to fail over 
> immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state standby. Visit 
> https://s.apache.org/sbnn-error
> where *a.b.c.d* is the active NN, which confuses user to think failover is 
> not behaving correctly.
> The reason is that the ProxyDescriptor data field of RetryInvocationHandler 
> may be shared by multiple threads that do the RPC calls, the failover done by 
> one thread may be visible to other threads when reporting the above kind of 
> message. 
> As an example, 
> # multiple threads start with the same SNN to do RPC calls, 
> # all threads discover that a failover is needed, 
> # thread X failover first, and changed the ProxyDescriptor's proxyInfo to ANN
> # other threads reports the above message with the proxyInfo changed by 
> thread X, and reported ANN instead of SNN in the message.
> Some details:
> RetryInvocationHandler does the following when failing over:
> {code}
>   synchronized void failover(long expectedFailoverCount, Method method,
>                                int callId) {
>       // Make sure that concurrent failed invocations only cause a single
>       // actual failover.
>       if (failoverCount == expectedFailoverCount) {
>         fpp.performFailover(proxyInfo.proxy);
>         failoverCount++;
>       } else {
>         LOG.warn("A failover has occurred since the start of call #" + callId
>             + " " + proxyInfo.getString(method.getName()));
>       }
>       proxyInfo = fpp.getProxy();
>     }
> {code}
> and changed the proxyInfo in the ProxyDescriptor.
> While the log method below report message with ProxyDescriotor's proxyinfo:
> {code}
> private void log(final Method method, final boolean isFailover,
>       final int failovers, final long delay, final Exception ex) {
> ......
>    final StringBuilder b = new StringBuilder()
>         .append(ex + ", while invoking ")
>         .append(proxyDescriptor.getProxyInfo().getString(method.getName()));
>     if (failovers > 0) {
>       b.append(" after ").append(failovers).append(" failover attempts");
>     }
>     b.append(isFailover? ". Trying to failover ": ". Retrying ");
>     b.append(delay > 0? "after sleeping for " + delay + "ms.": 
> "immediately.");
> {code}
> and so does  {{handleException}} method do
> {code}
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Exception while invoking call #" + callId + " "
>               + proxyDescriptor.getProxyInfo().getString(method.getName())
>               + ". Not retrying because " + retryInfo.action.reason, e);
>         }
> {code}
> FailoverProxyProvider
> {code}
>    public String getString(String methodName) {
>       return proxy.getClass().getSimpleName() + "." + methodName
>           + " over " + proxyInfo;
>     }
>     @Override
>     public String toString() {
>       return proxy.getClass().getSimpleName() + " over " + proxyInfo;
>     }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to