Viraj Jasani created HBASE-27955:
------------------------------------

             Summary: RefreshPeerProcedure should be resilient to replication 
endpoint failures
                 Key: HBASE-27955
                 URL: https://issues.apache.org/jira/browse/HBASE-27955
             Project: HBase
          Issue Type: Improvement
            Reporter: Viraj Jasani


UpdatePeerConfigProcedure gets stuck when we see some failures in 
RefreshPeerProcedure. The only way to move forward is either by restarting 
active master or bypassing the stuck procedure.

 

For instance,
{code:java}
2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
replication.RefreshPeerProcedure - Refresh peer core1.hbase1a_aws.prod5.uswest2 
for UPDATE_CONFIG on {host},{port},1687053857180 failed
java.lang.NullPointerException via 
{host},{port},1687053857180:java.lang.NullPointerException: 
    at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
    at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
    at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
Caused by: java.lang.NullPointerException: 
    at xyz(Abc.java:89)     <========= replication endpoint failure example
    at xyz(Abc.java:79)     <========= replication endpoint failure example     
at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
    at java.util.ArrayList.forEach(ArrayList.java:1259)
    at 
org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
    at 
org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
    at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
    at 
org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
    at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750) {code}
RefreshPeerProcedure should support reporting this failure and rollback of the 
parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to