[ https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani updated HBASE-27955: --------------------------------- Affects Version/s: 2.4.17 > RefreshPeerProcedure should be resilient to replication endpoint failures > ------------------------------------------------------------------------- > > Key: HBASE-27955 > URL: https://issues.apache.org/jira/browse/HBASE-27955 > Project: HBase > Issue Type: Improvement > Affects Versions: 2.4.17 > Reporter: Viraj Jasani > Priority: Major > > UpdatePeerConfigProcedure gets stuck when we see some failures in > RefreshPeerProcedure. The only way to move forward is either by restarting > active master or bypassing the stuck procedure. > > For instance, > {code:java} > 2023-06-26 17:22:08,375 WARN [,queue=24,port=61000] > replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on > {host},{port},1687053857180 failed > java.lang.NullPointerException via > {host},{port},1687053857180:java.lang.NullPointerException: > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123) > at > org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406) > at java.util.ArrayList.forEach(ArrayList.java:1259) > at > java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082) > at > org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349) > Caused by: java.lang.NullPointerException: > at xyz(Abc.java:89) <========= replication endpoint failure example > at xyz(Abc.java:79) <========= replication endpoint failure example > at > org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63) > at java.util.ArrayList.forEach(ArrayList.java:1259) > at > org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63) > at > org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131) > at > org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70) > at > org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35) > at > org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) {code} > RefreshPeerProcedure should support reporting this failure and rollback of > the parent procedure. -- This message was sent by Atlassian Jira (v8.20.10#820010)