[jira] [Updated] (HDFS-17769) Allows client to actively retry to Active NameNode when the Observer NameNode is too far behind client state id.

Guo Wei (Jira) Tue, 15 Apr 2025 18:59:29 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-17769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guo Wei updated HDFS-17769:
---------------------------
    Description: 
When we use Router to forward read requests to the observer, if the cluster 
experiences heavy write workloads, Observer nodes may fail to keep pace with 
edit log synchronization, even if the dfs.ha.tail-edits.in-progress parameter 
is configured, it may still occur.
This triggers RetriableException: Observer Node is too far behind errors. 
Especially when the client ipc.client.ping parameter is set to true, it will 
strive to wait and constantly retry, which can cause the business to be unable 
to obtain the desired data timely. We should consider having the active 
namenode handle this at this time.

Here are our some errors and repair verification：

The stateid of the observer is too far behind the active：
{code:java}
// code placeholder
Tue Apr 15 11:22:41 CST 2025, Active latest txId: 5698245512, Observer latest 
txId:5695118653，Observer far behind: 3126859, time takes0s 
Tue Apr 15 11:22:43 CST 2025, Active latest txId: 5698253145, Observer latest 
txId:5695118653，Observer far behind: 3134492, time takes0s 
Tue Apr 15 11:22:45 CST 2025, Active latest txId: 5698260942, Observer latest 
txId:5695118653，Observer far behind: 3142289, time takes0s 
Tue Apr 15 11:22:47 CST 2025, Active latest txId: 5698268614, Observer latest 
txId:5695123653，Observer far behind: 3144961, time takes0s 
Tue Apr 15 11:22:49 CST 2025, Active latest txId: 5698276490, Observer latest 
txId:5695123653，Observer far behind: 3152837, time takes0s 
Tue Apr 15 11:22:51 CST 2025, Active latest txId: 5698284361, Observer latest 
txId:5695128653，Observer far behind: 3155708, time takes0s 
Tue Apr 15 11:22:54 CST 2025, Active latest txId: 5698292641, Observer latest 
txId:5695128653，Observer far behind: 3163988, time takes0s {code}
 

RetriableException：

The client will throw a RetriableException and cannot connect to the router for 
reading:
{code:java}
// code placeholder
10:16:53.744 [IPC Client (24555242) connection to routerIp:8888 from hdfs] 
DEBUG org.apache.hadoop.ipc.Client - IPC Client (24555242) connection to 
routerIp:8888 from hdfs: stopped, remaining connections 0 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
 Observer Node is too far behind: serverStateId = 5695128653 clientStateId = 
5698292641 
 at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source) 
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 
 at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
 at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
 
 at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:505)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:972)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:981)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:883)
 
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1044)
 
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
 
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) 
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1106) 
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029) 
 at java.security.AccessController.doPrivileged(Native Method) 
 at javax.security.auth.Subject.doAs(Subject.java:422) 
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
 
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3063) 
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
 Observer Node is too far behind: serverStateId = 5632963133 clientStateId = 
5635526176 
 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) 
 at org.apache.hadoop.ipc.Client.call(Client.java:1513) 
 at org.apache.hadoop.ipc.Client.call(Client.java:1410) 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
 
 at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source) 
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:966)
 
 at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) 
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
 at java.lang.reflect.Method.invoke(Method.java:498) 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:637)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
 
 at 
org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:467)
 
 ... 15 more 
 
 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1584) 
 at org.apache.hadoop.ipc.Client.call(Client.java:1529) 
 at org.apache.hadoop.ipc.Client.call(Client.java:1426) 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
 
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
 
 at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) 
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$getFileInfo$41(ClientNamenodeProtocolTranslatorPB.java:820)
 
 at 
org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
 
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:820)
 
 at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
 at java.lang.reflect.Method.invoke(Method.java:498) 
 at 
org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadProxyProvider$RouterObserverReadInvocationHandler.invoke(RouterObserverReadProxyProvider.java:216)
 
 at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
 at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
 at java.lang.reflect.Method.invoke(Method.java:498) 
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437)
 
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170)
 
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162)
 
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100)
 
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
 
 at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1770) 
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1828)
 
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1825)
 
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1840)
 
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:611) 
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:468) 
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:432) 
 at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2592) 
 at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) 
 at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) 
 at hadoop.write_then_observer_read2.main(write_then_observer_read2.java:64) 
{code}
 

repair verification : 
{code:java}
// code placeholder
(1) View the status of the cluster NameNode:[root@20w ~]# hdfs haadmin -ns 
hh-rbf-test5 -getAllServiceState 
20w:8020                            active     
21w:8020                            standby    
22w:8020                            observer  
(2) We enable the dfs.namenode.observer.too.stale.retry.active.enable parameter 
and execute a read command on the 21w machine:[root@21w ~]# hdfs dfs -cat /t.sh 
/bin/ssh $1
(3) The read RPC request can be found in hdfs-audit.log in the active 
namennode, so the request is forwarded to the active namenode[root@20w ~]# tail 
-f /data/disk02/var/log/hadoop/hdfs/hdfs-audit.log|grep t.sh 
2025-04-15 11:24:31,148 INFO FSNamesystem.audit: allowed=true   ugi=root 
(auth:SIMPLE)  ip=/xx cmd=getfileinfo src=/t.sh       dst=null        perm=null 
      proto=rpc 
2025-04-15 11:24:31,461 INFO FSNamesystem.audit: allowed=true   ugi=root 
(auth:SIMPLE)  ip=/xx cmd=open        src=/t.sh       dst=null        perm=null 
      proto=rpc
(4) there are logs of retries to active in the observer log2025-04-15 
11:24:30,148 WARN  namenode.FSNamesystem 
(GlobalStateIdContext.java:receiveRequestState(163)) - Retrying to Active 
NameNode, Observer Node is too far behind: serverStateId = 5695393653 
clientStateId = 5699337672 {code}
 

  was:
When we use Router to forward read requests to the observer, if the cluster 
experiences heavy write workloads, Observer nodes may fail to keep pace with 
edit log synchronization, even if the dfs.ha.tail-edits.in-progress parameter 
is configured, it may still occur.
This triggers RetriableException: Observer Node is too far behind errors. 
Especially when the client ipc.client.ping parameter is set to true, it will 
strive to wait and constantly retry, which can cause the business to be unable 
to obtain the desired data timely. We should consider having the active 
namenode handle this at this time.

Here are our some errors and repair verification：

The stateid of the observer is too far behind the active：1.png

 

RetriableException：2.png

 

repair verification : 3.png

 


> Allows client to actively retry to Active NameNode when the Observer NameNode 
> is too far behind client state id.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17769
>                 URL: https://issues.apache.org/jira/browse/HDFS-17769
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.4, 3.3.6, 3.4.1
>            Reporter: Guo Wei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.2
>
>         Attachments: 1.png, 2.png, 3.png
>
>
> When we use Router to forward read requests to the observer, if the cluster 
> experiences heavy write workloads, Observer nodes may fail to keep pace with 
> edit log synchronization, even if the dfs.ha.tail-edits.in-progress parameter 
> is configured, it may still occur.
> This triggers RetriableException: Observer Node is too far behind errors. 
> Especially when the client ipc.client.ping parameter is set to true, it will 
> strive to wait and constantly retry, which can cause the business to be 
> unable to obtain the desired data timely. We should consider having the 
> active namenode handle this at this time.
> Here are our some errors and repair verification：
> The stateid of the observer is too far behind the active：
> {code:java}
> // code placeholder
> Tue Apr 15 11:22:41 CST 2025, Active latest txId: 5698245512, Observer latest 
> txId:5695118653，Observer far behind: 3126859, time takes0s 
> Tue Apr 15 11:22:43 CST 2025, Active latest txId: 5698253145, Observer latest 
> txId:5695118653，Observer far behind: 3134492, time takes0s 
> Tue Apr 15 11:22:45 CST 2025, Active latest txId: 5698260942, Observer latest 
> txId:5695118653，Observer far behind: 3142289, time takes0s 
> Tue Apr 15 11:22:47 CST 2025, Active latest txId: 5698268614, Observer latest 
> txId:5695123653，Observer far behind: 3144961, time takes0s 
> Tue Apr 15 11:22:49 CST 2025, Active latest txId: 5698276490, Observer latest 
> txId:5695123653，Observer far behind: 3152837, time takes0s 
> Tue Apr 15 11:22:51 CST 2025, Active latest txId: 5698284361, Observer latest 
> txId:5695128653，Observer far behind: 3155708, time takes0s 
> Tue Apr 15 11:22:54 CST 2025, Active latest txId: 5698292641, Observer latest 
> txId:5695128653，Observer far behind: 3163988, time takes0s {code}
>  
> RetriableException：
> The client will throw a RetriableException and cannot connect to the router 
> for reading:
> {code:java}
> // code placeholder
> 10:16:53.744 [IPC Client (24555242) connection to routerIp:8888 from hdfs] 
> DEBUG org.apache.hadoop.ipc.Client - IPC Client (24555242) connection to 
> routerIp:8888 from hdfs: stopped, remaining connections 0 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
>  Observer Node is too far behind: serverStateId = 5695128653 clientStateId = 
> 5698292641 
>  at sun.reflect.GeneratedConstructorAccessor49.newInstance(Unknown Source) 
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
>  at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
>  
>  at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:110)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:505)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeSequential(RouterRpcClient.java:972)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterClientProtocol.getFileInfo(RouterClientProtocol.java:981)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcServer.getFileInfo(RouterRpcServer.java:883)
>  
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:1044)
>  
>  at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>  
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227) 
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1106) 
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029) 
>  at java.security.AccessController.doPrivileged(Native Method) 
>  at javax.security.auth.Subject.doAs(Subject.java:422) 
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>  
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3063) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RetriableException):
>  Observer Node is too far behind: serverStateId = 5632963133 clientStateId = 
> 5635526176 
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1513) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1410) 
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>  
>  at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source) 
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:966)
>  
>  at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:637)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invoke(RouterRpcClient.java:654)
>  
>  at 
> org.apache.hadoop.hdfs.server.federation.router.RouterRpcClient.invokeMethod(RouterRpcClient.java:467)
>  
>  ... 15 more 
>  
>  at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1584) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1529) 
>  at org.apache.hadoop.ipc.Client.call(Client.java:1426) 
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>  
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>  
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) 
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$getFileInfo$41(ClientNamenodeProtocolTranslatorPB.java:820)
>  
>  at 
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
>  
>  at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:820)
>  
>  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.RouterObserverReadProxyProvider$RouterObserverReadInvocationHandler.invoke(RouterObserverReadProxyProvider.java:216)
>  
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
>  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) 
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
>  at java.lang.reflect.Method.invoke(Method.java:498) 
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:437)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:170)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:162)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:100)
>  
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:366)
>  
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) 
>  at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1770) 
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1828)
>  
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1825)
>  
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1840)
>  
>  at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:611) 
>  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:468) 
>  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:432) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2592) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2558) 
>  at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2520) 
>  at hadoop.write_then_observer_read2.main(write_then_observer_read2.java:64) 
> {code}
>  
> repair verification : 
> {code:java}
> // code placeholder
> (1) View the status of the cluster NameNode:[root@20w ~]# hdfs haadmin -ns 
> hh-rbf-test5 -getAllServiceState 
> 20w:8020                            active     
> 21w:8020                            standby    
> 22w:8020                            observer  
> (2) We enable the dfs.namenode.observer.too.stale.retry.active.enable 
> parameter and execute a read command on the 21w machine:[root@21w ~]# hdfs 
> dfs -cat /t.sh 
> /bin/ssh $1
> (3) The read RPC request can be found in hdfs-audit.log in the active 
> namennode, so the request is forwarded to the active namenode[root@20w ~]# 
> tail -f /data/disk02/var/log/hadoop/hdfs/hdfs-audit.log|grep t.sh 
> 2025-04-15 11:24:31,148 INFO FSNamesystem.audit: allowed=true   ugi=root 
> (auth:SIMPLE)  ip=/xx cmd=getfileinfo src=/t.sh       dst=null        
> perm=null       proto=rpc 
> 2025-04-15 11:24:31,461 INFO FSNamesystem.audit: allowed=true   ugi=root 
> (auth:SIMPLE)  ip=/xx cmd=open        src=/t.sh       dst=null        
> perm=null       proto=rpc
> (4) there are logs of retries to active in the observer log2025-04-15 
> 11:24:30,148 WARN  namenode.FSNamesystem 
> (GlobalStateIdContext.java:receiveRequestState(163)) - Retrying to Active 
> NameNode, Observer Node is too far behind: serverStateId = 5695393653 
> clientStateId = 5699337672 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17769) Allows client to actively retry to Active NameNode when the Observer NameNode is too far behind client state id.

Reply via email to