[ 
https://issues.apache.org/jira/browse/HDFS-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611039#comment-16611039
 ] 

Chao Sun commented on HDFS-13749:
---------------------------------

The test failure is because in {{testMultiObserver}}, we shutdown a observer 
and then restart it, and we expect the RPC should go to the observer once it is 
restarted.

However, it's interesting that after the observer is restarted, the 
{{getServiceStatus}} call will fail with EOF exception. I tried by wrapping the 
proxy with a RetryPolicy like the following:
{code}
  public static HAServiceProtocol createNonHAProxyWithHAServiceProtocol(
      InetSocketAddress address, Configuration conf) throws IOException {
    RetryPolicy timeoutPolicy = RetryPolicies.exponentialBackoffRetry(5, 200,
        TimeUnit.MILLISECONDS);

    HAServiceProtocol proxy =
        new HAServiceProtocolClientSideTranslatorPB(
            address, conf, NetUtils.getDefaultSocketFactory(conf),
            30000);
    Map<String,RetryPolicy> methodNameToPolicyMap = new HashMap<>();
    return (HAServiceProtocol) RetryProxy.create(
        HAServiceProtocol.class,
        new DefaultFailoverProxyProvider<>(HAServiceProtocol.class, proxy),
        methodNameToPolicyMap,
        timeoutPolicy
    );
{code}

but it still failed after multiple retries, with connection refused exception.

However, if I add a simple look in the {{refreshCachedState}}, then it always 
succeed on the second try:
{code}
    public void refreshCachedState() {
      for (int i = 0; i < 3; i++) {
        try {
          cachedState = serviceProxy.getServiceStatus().getState();
          LOG.info("Successfully set cache state to " + cachedState.name());
          return;
        } catch (IOException e) {
          LOG.warn("Failed to connect to {}. Setting cached state to Standby",
              address, e);
          cachedState = HAServiceState.STANDBY;
        }
      }
    }
{code}

> Use getServiceStatus to discover observer namenodes
> ---------------------------------------------------
>
>                 Key: HDFS-13749
>                 URL: https://issues.apache.org/jira/browse/HDFS-13749
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Chao Sun
>            Assignee: Chao Sun
>            Priority: Major
>         Attachments: HDFS-13749-HDFS-12943.000.patch, 
> HDFS-13749-HDFS-12943.001.patch, HDFS-13749-HDFS-12943.002.patch
>
>
> In HDFS-12976 currently we discover NameNode state by calling 
> {{reportBadBlocks}} as a temporary solution. Here, we'll properly implement 
> this by using {{HAServiceProtocol#getServiceStatus}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to