[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-17166:
--
  Component/s: rbf
 Target Version/s: 3.4.0
Affects Version/s: 3.4.0

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> HDFS-17166.003.patch, HDFS-17166.004.patch, HDFS-17166.005.patch, 
> HDFS-17166.patch, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png, 
> image-2023-08-26-12-01-39-968.png, image-2023-08-26-12-06-01-275.png, 
> image-2023-08-26-12-07-47-010.png, image-2023-08-26-22-45-46-814.png, 
> image-2023-08-26-22-47-22-276.png, image-2023-08-26-22-47-41-988.png, 
> image-2023-08-26-22-48-02-086.png, image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-09-05 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: HDFS-17166.patch

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> HDFS-17166.003.patch, HDFS-17166.004.patch, HDFS-17166.005.patch, 
> HDFS-17166.patch, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png, 
> image-2023-08-26-12-01-39-968.png, image-2023-08-26-12-06-01-275.png, 
> image-2023-08-26-12-07-47-010.png, image-2023-08-26-22-45-46-814.png, 
> image-2023-08-26-22-47-22-276.png, image-2023-08-26-22-47-41-988.png, 
> image-2023-08-26-22-48-02-086.png, image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.
>  
> *Test my patch*
> *1. Unit testing*
> *2. 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-09-05 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-17166:
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> HDFS-17166.003.patch, HDFS-17166.004.patch, HDFS-17166.005.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png, 
> image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-29 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: HDFS-17166.005.patch

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> HDFS-17166.003.patch, HDFS-17166.004.patch, HDFS-17166.005.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png, 
> image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.
>  
> *Test my patch*
> *1. Unit testing*
> *2. Comparison test*
>  * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-29 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: HDFS-17166.004.patch

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> HDFS-17166.003.patch, HDFS-17166.004.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png, 
> image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.
>  
> *Test my patch*
> *1. Unit testing*
> *2. Comparison test*
>  * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the 
> ns 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-28 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: HDFS-17166.003.patch

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> HDFS-17166.003.patch, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png, 
> image-2023-08-26-12-01-39-968.png, image-2023-08-26-12-06-01-275.png, 
> image-2023-08-26-12-07-47-010.png, image-2023-08-26-22-45-46-814.png, 
> image-2023-08-26-22-47-22-276.png, image-2023-08-26-22-47-41-988.png, 
> image-2023-08-26-22-48-02-086.png, image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.
>  
> *Test my patch*
> *1. Unit testing*
> *2. Comparison test*
>  * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the 
> ns has 2 nn [nn6001 nn6002]

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: (was: screenshot-1.png)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png, 
> image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.
>  
> *Test my patch*
> *1. Unit testing*
> *2. Comparison test*
>  * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the 
> ns has 2 nn [nn6001 nn6002]
>  * If both 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-12-06-01-275.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-12-07-47-010.png|width=800,height=20!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*How to reproduce*
 # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 
is standby
 # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
nn6002 is higher than nn6001
 # Use default configuration
 # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
failover
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToStandby --forcemanual nn6001* 
 # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby  
!image-2023-08-26-11-48-22-131.png|width=800,height=20!
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToActive --forcemanual nn6001* 
 # The client accesses ns60 through router  
!image-2023-08-26-11-56-50-181.png|width=800,height=50!
 # After about one minute, request ns60 again through the router  
!image-2023-08-26-11-59-25-153.png|width=800,height=50!
 # Exceptions are reported for both requests, check the router log  
!image-2023-08-26-12-01-39-968.png|width=800,height=20!
 # The router cannot respond to the client's request for ns60 for a minute

 

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

 

*Test my patch*

*1. Unit testing*

*2. Comparison test*
 * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the ns 
has 2 nn [nn6001 nn6002]
 * If both nn6001 and nn6002 are in standby state, the priority of nn6002 is 
higher than nn6001,
 * r1 uses the package that fixing the bug, r2 uses the original package which 
has the bug
 * c1 loops to send requests to r1, and c2 loops to send requests to r2, the 
request is related to ns60
 * Make both nn6001 and nn6002 in standby state
 * After the router reports that nn is in standby state, switch nn6001 to active
*14:15:24* nn6001 is active  
!image-2023-08-26-22-45-46-814.png|width=800,height=120!

 * Check the log of router r1, after nn6001 switches to active, only 
NoNamenodesAvailableException is printed once 
!image-2023-08-26-22-47-22-276.png|width=800,height=30!

 
 * Check the log of router r2, and print NoNamenodesAvailableException for more 
than one minute after nn6001 switches to active 
!image-2023-08-26-22-47-41-988.png|width=800,height=150!

 
 * At 14:16:25, the client c2 accessing the router with the bug could not get 
the data, and the client c1 accessing the router after the bug was fixed could 
get the data normally:

c2's 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-22-48-12-352.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png, 
> image-2023-08-26-22-48-12-352.png, screenshot-1.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-22-48-02-086.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png, 
> image-2023-08-26-22-48-12-352.png, screenshot-1.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-22-47-41-988.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, screenshot-1.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-22-47-22-276.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png, 
> image-2023-08-26-22-47-41-988.png, screenshot-1.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-22-45-46-814.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, screenshot-1.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-26 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: screenshot-1.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png, 
> image-2023-08-26-22-45-46-814.png, screenshot-1.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
> nn6002 is higher than nn6001
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-12-06-01-275.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-12-07-47-010.png|width=800,height=20!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*How to reproduce*
 # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 
is standby
 # Assuming that nn6001 and nn6002 are both in standby state, the priority of 
nn6002 is higher than nn6001
 # Use default configuration
 # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
failover
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToStandby --forcemanual nn6001* 
 # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby  
!image-2023-08-26-11-48-22-131.png|width=800,height=20!
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToActive --forcemanual nn6001* 
 # The client accesses ns60 through router  
!image-2023-08-26-11-56-50-181.png|width=800,height=50!
 # After about one minute, request ns60 again through the router  
!image-2023-08-26-11-59-25-153.png|width=800,height=50!
 # Exceptions are reported for both requests, check the router log  
!image-2023-08-26-12-01-39-968.png|width=800,height=20!
 # The router cannot respond to the client's request for ns60 for a minute

 

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

  was:
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-12-06-01-275.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-12-07-47-010.png|width=800,height=20!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: (was: image-2023-08-26-11-48-07-378.png)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: (was: image-2023-08-26-00-24-02-016.png)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: (was: image-2023-08-26-00-25-42-086.png)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png, 
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png, 
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToActive --forcemanual nn6001* 
>  # The client accesses ns60 through router  
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
>  # After about one minute, request ns60 again through the router  
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
>  # Exceptions are reported for both requests, check the router log  
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
>  # The router cannot respond to the client's request for ns60 for a minute
>  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-12-06-01-275.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-12-07-47-010.png|width=800,height=20!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*How to reproduce*
 # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 
is standby
 # Use default configuration
 # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
failover
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToStandby --forcemanual nn6001* 
 # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby  
!image-2023-08-26-11-48-22-131.png|width=800,height=20!
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToActive --forcemanual nn6001* 
 # The client accesses ns60 through router  
!image-2023-08-26-11-56-50-181.png|width=800,height=50!
 # After about one minute, request ns60 again through the router  
!image-2023-08-26-11-59-25-153.png|width=800,height=50!
 # Exceptions are reported for both requests, check the router log  
!image-2023-08-26-12-01-39-968.png|width=800,height=20!
 # The router cannot respond to the client's request for ns60 for a minute

 

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

  was:
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-12-07-47-010.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png, 
> image-2023-08-26-12-01-39-968.png, image-2023-08-26-12-06-01-275.png, 
> image-2023-08-26-12-07-47-010.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  #  
>  #  
>  
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-12-06-01-275.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png, 
> image-2023-08-26-12-01-39-968.png, image-2023-08-26-12-06-01-275.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  #  
>  #  
>  
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-12-01-39-968.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png, 
> image-2023-08-26-12-01-39-968.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  #  
>  #  
>  
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-11-59-25-153.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png, image-2023-08-26-11-59-25-153.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  #  
>  #  
>  
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-11-56-50-181.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png, 
> image-2023-08-26-11-56-50-181.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001* 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
>  !image-2023-08-26-11-48-22-131.png|width=800,height=20!
>  #  
>  #  
>  
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*How to reproduce*
 # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 
is standby
 # Use default configuration
 # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
failover
 # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60 
-transitionToStandby --forcemanual nn6001* 
 # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby  
!image-2023-08-26-11-48-22-131.png|width=800,height=20!
 #  
 #  

 
 #  

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

  was:
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*How to reproduce*
 # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 
is standby
 # Use default configuration
 # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
failover
 # Manually switch nn6001 active->standby, hdfs haadmin -ns ns60 
-transitionToStandby --forcemanual nn6001 
 # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby 
!image-2023-08-26-11-48-22-131.png!
 #  
 #  

 
 #  

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

  was:
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-11-48-22-131.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: image-2023-08-26-11-48-07-378.png

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png, 
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
>  * A certain ns starts to fail over,
>  * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
>  * After a period of time, the router pulls the state store data to update 
> the cache, and the cache records that the ns has no active nn
>  * Failover successfully completed, at which point the ns actually has an 
> active nn
>  *  Assuming it's not time for router to update the cache yet
>  *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
>  * Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
>  *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *How to reproduce*
>  # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and 
> nn6002 is standby
>  # Use default configuration
>  # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
> failover
>  # Manually switch nn6001 active->standby, hdfs haadmin -ns ns60 
> -transitionToStandby --forcemanual nn6001 
>  # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
>  #  
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*
 * A certain ns starts to fail over,

 * There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

 * After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
 * Failover successfully completed, at which point the ns actually has an 
active nn

 *  Assuming it's not time for router to update the cache yet

 *  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

 * Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

 *  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*How to reproduce*
 # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and nn6002 
is standby
 # Use default configuration
 # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform 
failover
 # Manually switch nn6001 active->standby, hdfs haadmin -ns ns60 
-transitionToStandby --forcemanual nn6001 
 # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
 #  

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

  was:
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*

* A certain ns starts to fail over,

* There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

* After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
*  Failover successfully completed, at which point the ns actually has an 
active nn

*  Assuming it's not time for router to update the cache yet

*  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

*  Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

*  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*Fix the bug*

When an ns 

[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: HDFS-17166.001.patch

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update the 
> cache, and the cache records that the ns has no active nn
> *  Failover successfully completed, at which point the ns actually has an 
> active nn
> *  Assuming it's not time for router to update the cache yet
> *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
> *  Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: HDFS-17166.002.patch

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.002.patch, image-2023-08-26-00-24-02-016.png, 
> image-2023-08-26-00-25-42-086.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update the 
> cache, and the cache records that the ns has no active nn
> *  Failover successfully completed, at which point the ns actually has an 
> active nn
> *  Assuming it's not time for router to update the cache yet
> *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
> *  Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: (was: 
fix_NoNamenodesAvailableException_long_time_when_ns_failover.patch)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17166.002.patch, image-2023-08-26-00-24-02-016.png, 
> image-2023-08-26-00-25-42-086.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update the 
> cache, and the cache records that the ns has no active nn
> *  Failover successfully completed, at which point the ns actually has an 
> active nn
> *  Assuming it's not time for router to update the cache yet
> *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
> *  Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Attachment: 
fix_NoNamenodesAvailableException_long_time_when_ns_failover.patch
Status: Patch Available  (was: Open)

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 
> fix_NoNamenodesAvailableException_long_time_when_ns_failover.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update the 
> cache, and the cache records that the ns has no active nn
> *  Failover successfully completed, at which point the ns actually has an 
> active nn
> *  Assuming it's not time for router to update the cache yet
> *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
> *  Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17166:
--
Labels: pull-request-available  (was: )

> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-08-26-00-24-02-016.png, 
> image-2023-08-26-00-25-42-086.png
>
>
> When ns failover, the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update the 
> cache, and the cache records that the ns has no active nn
> *  Failover successfully completed, at which point the ns actually has an 
> active nn
> *  Assuming it's not time for router to update the cache yet
> *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
> *  Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

2023-08-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17166:
--
Description: 
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-02-016.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*

* A certain ns starts to fail over,

* There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

* After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
*  Failover successfully completed, at which point the ns actually has an 
active nn

*  Assuming it's not time for router to update the cache yet

*  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

*  Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

*  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.

  was:
When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-16-538.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*

* A certain ns starts to fail over,

* There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

* After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
*  Failover successfully completed, at which point the ns actually has an 
active nn

*  Assuming it's not time for router to update the cache yet

*  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

*  Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

*  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next