[jira] [Commented] (HDFS-17232) RBF: Fix NoNamenodesAvailableException for a long time, when use observer

ASF GitHub Bot (Jira) Wed, 15 Nov 2023 12:53:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786498#comment-17786498
 ]


ASF GitHub Bot commented on HDFS-17232:
---------------------------------------

simbadzina commented on code in PR #6208:
URL: https://github.com/apache/hadoop/pull/6208#discussion_r1394785317


##########
hadoop-hdfs-project/hadoop-hdfs-rbf/src/test/java/org/apache/hadoop/hdfs/server/federation/MiniRouterDFSCluster.java:
##########
@@ -253,6 +253,19 @@ public FileSystem 
getFileSystemWithObserverReadProxyProvider() throws IOExceptio
       return DistributedFileSystem.get(observerReadConf);
     }
 
+    public FileSystem getFileSystemWithConfiguredFailoverProxyProvider() 
throws IOException {
+      conf.set(DFS_NAMESERVICES,
+          conf.get(DFS_NAMESERVICES)+ ",router-service");
+      conf.set(DFS_HA_NAMENODES_KEY_PREFIX + ".router-service", "router1");
+      conf.set(DFS_NAMENODE_RPC_ADDRESS_KEY+ ".router-service.router1",
+          getFileSystemURI().toString());
+      conf.set(HdfsClientConfigKeys.Failover.PROXY_PROVIDER_KEY_PREFIX
+          + "." + "router-service", 
ConfiguredFailoverProxyProvider.class.getName());
+      DistributedFileSystem.setDefaultUri(conf, "hdfs://router-service");
+
+      return DistributedFileSystem.get(conf);
+    }

Review Comment:
   `getFileSystemWithConfiguredFailoverProxyProvider` and 
`getFileSystemWithObserverReadProxyProvider()` share the same configuration, 
beside the proxy provider. You can separate shared config these one into a 
separate function, then just set the proxy provider.





> RBF: Fix NoNamenodesAvailableException for a long time, when use observer
> -------------------------------------------------------------------------
>
>                 Key: HDFS-17232
>                 URL: https://issues.apache.org/jira/browse/HDFS-17232
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Jian Zhang
>            Assignee: Jian Zhang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HDFS-17232.001.patch
>
>
> *Describe*
> I solved the NoNamenodesAvailableException for a long time, when failover 
> without using observer, but when using observer, there are still many 
> problems.
>  #  When the observer fails and there is no active namenode at this time, 
> even if we can rotate the cache, the next request will shuffle the observer 
> namenode to the front of the cache due to the use of the observer, so retry 
> will still send the request to the failed observer node.
>  # If there are multiple observers, and an exception occurs when accessing an 
> observer and there is no active namenode at this time, a 
> NoNamenodesAvailableException will be caused and the server will try again. 
> However, since using the observer will put the observer node at the front of 
> the cache, it may still fail.
>  # When there are multiple observers, one of which is unavailable and there 
> is no active namenode at this time, we should continue to try the next 
> observer, so that the currently unavailable observer can be marked as 
> unavailable, and subsequent requests can avoid the unavailable observer.
>  # If it is due to an illegal operation, that is, even if the operation is 
> sent to the active namenode, an exception will occur, resulting in 
> NoNamenodesAvailableException. If the cache is rotated at this time, the next 
> normal request will be sent to the namenode that is indeed the standby, 
> causing an error in the legal request. , so illegal operations should not 
> rotate the cache.
>  
> Detailed bug description: HDFS-17166
>  
> - *case  1:*
> * router's cache : [ observer-1(problematic), standby-2, standby-3(actually 
> active) ]
> * client read  -> observer-1   throw   NoNamenodesAvailableException  -> 
> rotate the cache -> [ standby-2, standby-3(actually 
> active),observer-1(problematic) ]
> * client retry read ->  shuffleObserverNN ->   [ observer-1(problematic), 
> standby-2, standby-3(actually active) ] -> observer-1   throw   
> NoNamenodesAvailableException  -> rotate the cache -> [ standby-2, 
> standby-3(actually active),observer-1(problematic) ]
> * *.....*
> * client  (reties > max.attempts )   ->    Read failed
>  
> - *case 2:*
> * router's cache :   [ observer-1(problematic), observer-2, standby-3, 
> standby-4(actually active) ]  
> * client read  -> observer-1   throw   NoNamenodesAvailableException  -> 
> rotate the cache -> [ observer-2, standby-3, standby-4(actually 
> active),observer-1(problematic) ]
> * client retry read ->  shuffleObserverNN ->  [ observer-1(problematic), 
> observer-2, standby-3, standby-4(actually active) ] (may happen) -> 
> observer-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ 
> observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]
> * *.....*
> * client  may (reties > max.attempts )   ->    Read failed
> - *case 3:*
> * router's cache :   [ standby-1, standby-2(actually active) ]  
> * client request  -> standby-1   throw   NoNamenodesAvailableException  -> 
> rotate the cache -> [ standby-2(actually active),standby-1 ]
> * client retry request ->  standby-2(actually active) success
> * client Illegal request -> standby-2(actually active)  throw   
> NoNamenodesAvailableException -> rotate the cache -> [standby1, 
> standby-2(actually active) ]
> * client legal request -> standby1 throw   NoNamenodesAvailableException  
> failed
> *How to reproduce*
> I have provided unit tests:TestNoNamenodesAvailableLongTime
> You can use the original code and run my new unit tests to reproduce the 
> above problems.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17232) RBF: Fix NoNamenodesAvailableException for a long time, when use observer

Reply via email to