[ https://issues.apache.org/jira/browse/HDFS-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611039#comment-16611039 ]
Chao Sun commented on HDFS-13749: --------------------------------- The test failure is because in {{testMultiObserver}}, we shutdown a observer and then restart it, and we expect the RPC should go to the observer once it is restarted. However, it's interesting that after the observer is restarted, the {{getServiceStatus}} call will fail with EOF exception. I tried by wrapping the proxy with a RetryPolicy like the following: {code} public static HAServiceProtocol createNonHAProxyWithHAServiceProtocol( InetSocketAddress address, Configuration conf) throws IOException { RetryPolicy timeoutPolicy = RetryPolicies.exponentialBackoffRetry(5, 200, TimeUnit.MILLISECONDS); HAServiceProtocol proxy = new HAServiceProtocolClientSideTranslatorPB( address, conf, NetUtils.getDefaultSocketFactory(conf), 30000); Map<String,RetryPolicy> methodNameToPolicyMap = new HashMap<>(); return (HAServiceProtocol) RetryProxy.create( HAServiceProtocol.class, new DefaultFailoverProxyProvider<>(HAServiceProtocol.class, proxy), methodNameToPolicyMap, timeoutPolicy ); {code} but it still failed after multiple retries, with connection refused exception. However, if I add a simple look in the {{refreshCachedState}}, then it always succeed on the second try: {code} public void refreshCachedState() { for (int i = 0; i < 3; i++) { try { cachedState = serviceProxy.getServiceStatus().getState(); LOG.info("Successfully set cache state to " + cachedState.name()); return; } catch (IOException e) { LOG.warn("Failed to connect to {}. Setting cached state to Standby", address, e); cachedState = HAServiceState.STANDBY; } } } {code} > Use getServiceStatus to discover observer namenodes > --------------------------------------------------- > > Key: HDFS-13749 > URL: https://issues.apache.org/jira/browse/HDFS-13749 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Chao Sun > Assignee: Chao Sun > Priority: Major > Attachments: HDFS-13749-HDFS-12943.000.patch, > HDFS-13749-HDFS-12943.001.patch, HDFS-13749-HDFS-12943.002.patch > > > In HDFS-12976 currently we discover NameNode state by calling > {{reportBadBlocks}} as a temporary solution. Here, we'll properly implement > this by using {{HAServiceProtocol#getServiceStatus}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org