[ https://issues.apache.org/jira/browse/HDFS-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786498#comment-17786498 ]
ASF GitHub Bot commented on HDFS-17232: --------------------------------------- simbadzina commented on code in PR #6208: URL: https://github.com/apache/hadoop/pull/6208#discussion_r1394785317 ########## hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/MiniRouterDFSCluster.java: ########## @@ -253,6 +253,19 @@ public FileSystem getFileSystemWithObserverReadProxyProvider() throws IOExceptio return DistributedFileSystem.get(observerReadConf); } + public FileSystem getFileSystemWithConfiguredFailoverProxyProvider() throws IOException { + conf.set(DFS_NAMESERVICES, + conf.get(DFS_NAMESERVICES)+ ",router-service"); + conf.set(DFS_HA_NAMENODES_KEY_PREFIX + ".router-service", "router1"); + conf.set(DFS_NAMENODE_RPC_ADDRESS_KEY+ ".router-service.router1", + getFileSystemURI().toString()); + conf.set(HdfsClientConfigKeys.Failover.PROXY_PROVIDER_KEY_PREFIX + + "." + "router-service", ConfiguredFailoverProxyProvider.class.getName()); + DistributedFileSystem.setDefaultUri(conf, "hdfs://router-service"); + + return DistributedFileSystem.get(conf); + } Review Comment: `getFileSystemWithConfiguredFailoverProxyProvider` and `getFileSystemWithObserverReadProxyProvider()` share the same configuration, beside the proxy provider. You can separate shared config these one into a separate function, then just set the proxy provider. > RBF: Fix NoNamenodesAvailableException for a long time, when use observer > ------------------------------------------------------------------------- > > Key: HDFS-17232 > URL: https://issues.apache.org/jira/browse/HDFS-17232 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Jian Zhang > Assignee: Jian Zhang > Priority: Major > Labels: pull-request-available > Attachments: HDFS-17232.001.patch > > > *Describe* > I solved the NoNamenodesAvailableException for a long time, when failover > without using observer, but when using observer, there are still many > problems. > # When the observer fails and there is no active namenode at this time, > even if we can rotate the cache, the next request will shuffle the observer > namenode to the front of the cache due to the use of the observer, so retry > will still send the request to the failed observer node. > # If there are multiple observers, and an exception occurs when accessing an > observer and there is no active namenode at this time, a > NoNamenodesAvailableException will be caused and the server will try again. > However, since using the observer will put the observer node at the front of > the cache, it may still fail. > # When there are multiple observers, one of which is unavailable and there > is no active namenode at this time, we should continue to try the next > observer, so that the currently unavailable observer can be marked as > unavailable, and subsequent requests can avoid the unavailable observer. > # If it is due to an illegal operation, that is, even if the operation is > sent to the active namenode, an exception will occur, resulting in > NoNamenodesAvailableException. If the cache is rotated at this time, the next > normal request will be sent to the namenode that is indeed the standby, > causing an error in the legal request. , so illegal operations should not > rotate the cache. > > Detailed bug description: HDFS-17166 > > - *case 1:* > * router's cache : [ observer-1(problematic), standby-2, standby-3(actually > active) ] > * client read -> observer-1 throw NoNamenodesAvailableException -> > rotate the cache -> [ standby-2, standby-3(actually > active),observer-1(problematic) ] > * client retry read -> shuffleObserverNN -> [ observer-1(problematic), > standby-2, standby-3(actually active) ] -> observer-1 throw > NoNamenodesAvailableException -> rotate the cache -> [ standby-2, > standby-3(actually active),observer-1(problematic) ] > * *.....* > * client (reties > max.attempts ) -> Read failed > > - *case 2:* > * router's cache : [ observer-1(problematic), observer-2, standby-3, > standby-4(actually active) ] > * client read -> observer-1 throw NoNamenodesAvailableException -> > rotate the cache -> [ observer-2, standby-3, standby-4(actually > active),observer-1(problematic) ] > * client retry read -> shuffleObserverNN -> [ observer-1(problematic), > observer-2, standby-3, standby-4(actually active) ] (may happen) -> > observer-1 throw NoNamenodesAvailableException -> rotate the cache -> [ > observer-2, standby-3, standby-4(actually active),observer-1(problematic) ] > * *.....* > * client may (reties > max.attempts ) -> Read failed > - *case 3:* > * router's cache : [ standby-1, standby-2(actually active) ] > * client request -> standby-1 throw NoNamenodesAvailableException -> > rotate the cache -> [ standby-2(actually active),standby-1 ] > * client retry request -> standby-2(actually active) success > * client Illegal request -> standby-2(actually active) throw > NoNamenodesAvailableException -> rotate the cache -> [standby1, > standby-2(actually active) ] > * client legal request -> standby1 throw NoNamenodesAvailableException > failed > *How to reproduce* > I have provided unit tests:TestNoNamenodesAvailableLongTime > You can use the original code and run my new unit tests to reproduce the > above problems. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org