Viraj Jasani created HBASE-27955: ------------------------------------ Summary: RefreshPeerProcedure should be resilient to replication endpoint failures Key: HBASE-27955 URL: https://issues.apache.org/jira/browse/HBASE-27955 Project: HBase Issue Type: Improvement Reporter: Viraj Jasani
UpdatePeerConfigProcedure gets stuck when we see some failures in RefreshPeerProcedure. The only way to move forward is either by restarting active master or bypassing the stuck procedure. For instance, {code:java} 2023-06-26 17:22:08,375 WARN [,queue=24,port=61000] replication.RefreshPeerProcedure - Refresh peer core1.hbase1a_aws.prod5.uswest2 for UPDATE_CONFIG on {host},{port},1687053857180 failed java.lang.NullPointerException via {host},{port},1687053857180:java.lang.NullPointerException: at org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123) at org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406) at java.util.ArrayList.forEach(ArrayList.java:1259) at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082) at org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349) Caused by: java.lang.NullPointerException: at xyz(Abc.java:89) <========= replication endpoint failure example at xyz(Abc.java:79) <========= replication endpoint failure example at org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63) at java.util.ArrayList.forEach(ArrayList.java:1259) at org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63) at org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70) at org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35) at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} RefreshPeerProcedure should support reporting this failure and rollback of the parent procedure. -- This message was sent by Atlassian Jira (v8.20.10#820010)