[jira] [Updated] (HDFS-17232) RBF: Fix NoNamenodesAvailableException for a long time, when use observer

Jian Zhang (Jira) Sun, 22 Oct 2023 23:41:05 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jian Zhang updated HDFS-17232:
------------------------------
    Description: 
*Describe*

I solved the NoNamenodesAvailableException for a long time, when failover 
without using observer, but when using observer, there are still many problems.
 #  When the observer fails and there is no active namenode at this time, even 
if we can rotate the cache, the next request will shuffle the observer namenode 
to the front of the cache due to the use of the observer, so retry will still 
send the request to the failed observer node.
 # If there are multiple observers, and an exception occurs when accessing an 
observer and there is no active namenode at this time, a 
NoNamenodesAvailableException will be caused and the server will try again. 
However, since using the observer will put the observer node at the front of 
the cache, it may still fail.
 # When there are multiple observers, one of which is unavailable and there is 
no active namenode at this time, we should continue to try the next observer, 
so that the currently unavailable observer can be marked as unavailable, and 
subsequent requests can avoid the unavailable observer.
 # If it is due to an illegal operation, that is, even if the operation is sent 
to the active namenode, an exception will occur, resulting in 
NoNamenodesAvailableException. If the cache is rotated at this time, the next 
normal request will be sent to the namenode that is indeed the standby, causing 
an error in the legal request. , so illegal operations should not rotate the 
cache.

 

Detailed bug description: HDFS-17166

 

- *case  1:*
* router's cache : [ observer-1(problematic), standby-2, standby-3(actually 
active) ]

* client read  -> observer-1   throw   NoNamenodesAvailableException  -> rotate 
the cache -> [ standby-2, standby-3(actually active),observer-1(problematic) ]

* client retry read ->  shuffleObserverNN ->   [ observer-1(problematic), 
standby-2, standby-3(actually active) ] -> observer-1   throw   
NoNamenodesAvailableException  -> rotate the cache -> [ standby-2, 
standby-3(actually active),observer-1(problematic) ]

* *.....*

* client  (reties > max.attempts )   ->    Read failed
 
- *case 2:*
* router's cache :   [ observer-1(problematic), observer-2, standby-3, 
standby-4(actually active) ]  

* client read  -> observer-1   throw   NoNamenodesAvailableException  -> rotate 
the cache -> [ observer-2, standby-3, standby-4(actually 
active),observer-1(problematic) ]

* client retry read ->  shuffleObserverNN ->  [ observer-1(problematic), 
observer-2, standby-3, standby-4(actually active) ] (may happen) -> observer-1  
 throw   NoNamenodesAvailableException  -> rotate the cache -> [ observer-2, 
standby-3, standby-4(actually active),observer-1(problematic) ]

* *.....*

* client  may (reties > max.attempts )   ->    Read failed



- *case 3:*
* router's cache :   [ standby-1, standby-2(actually active) ]  

* client request  -> standby-1   throw   NoNamenodesAvailableException  -> 
rotate the cache -> [ standby-2(actually active),standby-1 ]

* client retry request ->  standby-2(actually active) success

* client Illegal request -> standby-2(actually active)  throw   
NoNamenodesAvailableException -> rotate the cache -> [standby1, 
standby-2(actually active) ]

* client legal request -> standby1 throw   NoNamenodesAvailableException  failed


*How to reproduce*

I have provided unit tests:TestNoNamenodesAvailableLongTime

You can use the original code and run my new unit tests to reproduce the above 
problems.

 

 

  was:
*Describe*

I solved the NoNamenodesAvailableException for a long time, when failover 
without using observer, but when using observer, there are still many problems.
 #  When the observer fails and there is no active namenode at this time, even 
if we can rotate the cache, the next request will shuffle the observer namenode 
to the front of the cache due to the use of the observer, so retry will still 
send the request to the failed observer node.
 # If there are multiple observers, and an exception occurs when accessing an 
observer and there is no active namenode at this time, a 
NoNamenodesAvailableException will be caused and the server will try again. 
However, since using the observer will put the observer node at the front of 
the cache, it may still fail.
 # When there are multiple observers, one of which is unavailable and there is 
no active namenode at this time, we should continue to try the next observer, 
so that the currently unavailable observer can be marked as unavailable, and 
subsequent requests can avoid the unavailable observer.
 # If it is due to an illegal operation, that is, even if the operation is sent 
to the active namenode, an exception will occur, resulting in 
NoNamenodesAvailableException. If the cache is rotated at this time, the next 
normal request will be sent to the namenode that is indeed the standby, causing 
an error in the legal request. , so illegal operations should not rotate the 
cache.

 

Detailed bug description: HDFS-17166

 

{*}How to reproduce{*}

I have provided unit tests:TestNoNamenodesAvailableLongTime

You can use the original code and run my new unit tests to reproduce the above 
problems.

 

 


> RBF: Fix NoNamenodesAvailableException for a long time, when use observer
> -------------------------------------------------------------------------
>
>                 Key: HDFS-17232
>                 URL: https://issues.apache.org/jira/browse/HDFS-17232
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Jian Zhang
>            Assignee: Jian Zhang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HDFS-17232.001.patch
>
>
> *Describe*
> I solved the NoNamenodesAvailableException for a long time, when failover 
> without using observer, but when using observer, there are still many 
> problems.
>  #  When the observer fails and there is no active namenode at this time, 
> even if we can rotate the cache, the next request will shuffle the observer 
> namenode to the front of the cache due to the use of the observer, so retry 
> will still send the request to the failed observer node.
>  # If there are multiple observers, and an exception occurs when accessing an 
> observer and there is no active namenode at this time, a 
> NoNamenodesAvailableException will be caused and the server will try again. 
> However, since using the observer will put the observer node at the front of 
> the cache, it may still fail.
>  # When there are multiple observers, one of which is unavailable and there 
> is no active namenode at this time, we should continue to try the next 
> observer, so that the currently unavailable observer can be marked as 
> unavailable, and subsequent requests can avoid the unavailable observer.
>  # If it is due to an illegal operation, that is, even if the operation is 
> sent to the active namenode, an exception will occur, resulting in 
> NoNamenodesAvailableException. If the cache is rotated at this time, the next 
> normal request will be sent to the namenode that is indeed the standby, 
> causing an error in the legal request. , so illegal operations should not 
> rotate the cache.
>  
> Detailed bug description: HDFS-17166
>  
> - *case  1:*
> * router's cache : [ observer-1(problematic), standby-2, standby-3(actually 
> active) ]
> * client read  -> observer-1   throw   NoNamenodesAvailableException  -> 
> rotate the cache -> [ standby-2, standby-3(actually 
> active),observer-1(problematic) ]
> * client retry read ->  shuffleObserverNN ->   [ observer-1(problematic), 
> standby-2, standby-3(actually active) ] -> observer-1   throw   
> NoNamenodesAvailableException  -> rotate the cache -> [ standby-2, 
> standby-3(actually active),observer-1(problematic) ]
> * *.....*
> * client  (reties > max.attempts )   ->    Read failed
>  
> - *case 2:*
> * router's cache :   [ observer-1(problematic), observer-2, standby-3, 
> standby-4(actually active) ]  
> * client read  -> observer-1   throw   NoNamenodesAvailableException  -> 
> rotate the cache -> [ observer-2, standby-3, standby-4(actually 
> active),observer-1(problematic) ]
> * client retry read ->  shuffleObserverNN ->  [ observer-1(problematic), 
> observer-2, standby-3, standby-4(actually active) ] (may happen) -> 
> observer-1   throw   NoNamenodesAvailableException  -> rotate the cache -> [ 
> observer-2, standby-3, standby-4(actually active),observer-1(problematic) ]
> * *.....*
> * client  may (reties > max.attempts )   ->    Read failed
> - *case 3:*
> * router's cache :   [ standby-1, standby-2(actually active) ]  
> * client request  -> standby-1   throw   NoNamenodesAvailableException  -> 
> rotate the cache -> [ standby-2(actually active),standby-1 ]
> * client retry request ->  standby-2(actually active) success
> * client Illegal request -> standby-2(actually active)  throw   
> NoNamenodesAvailableException -> rotate the cache -> [standby1, 
> standby-2(actually active) ]
> * client legal request -> standby1 throw   NoNamenodesAvailableException  
> failed
> *How to reproduce*
> I have provided unit tests:TestNoNamenodesAvailableLongTime
> You can use the original code and run my new unit tests to reproduce the 
> above problems.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17232) RBF: Fix NoNamenodesAvailableException for a long time, when use observer

Reply via email to