[ 
https://issues.apache.org/jira/browse/HBASE-27955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738226#comment-17738226
 ] 

Viraj Jasani commented on HBASE-27955:
--------------------------------------

That is correct, NPE is code bug in the custom replication endpoint, however 
the point i am trying to make is: as soon as this NPE gets reported, 
RefreshPeerProcedure gets completed but not rolled back (rollback is not 
supported). And the next step in the parent procedure i.e. 
POST_PEER_MODIFICATION would stay stuck and it doesn't even get executed. The 
only clue i have is that the previous step of the procedure had above NPE 
reported and it got completed (succ flag is modified to false)

 
{code:java}
@Override
protected void complete(MasterProcedureEnv env, Throwable error) {
  if (error != null) {
    LOG.warn("Refresh peer {} for {} on {} failed", peerId, type, targetServer, 
error);
    this.succ = false;
  } else {
    LOG.info("Refresh peer {} for {} on {} suceeded", peerId, type, 
targetServer);
    this.succ = true;
  }
} {code}
 

 

Thread dumps had nothing reported that could indicate why 
POST_PEER_MODIFICATION was stuck.

 

If we could introduce rollback in RefreshPeerProcedure, that could help at 
least complete the procedure with rollback rather than letting it stay stuck at 
next step (POST_PEER_MODIFICATION).

> RefreshPeerProcedure should be resilient to replication endpoint failures
> -------------------------------------------------------------------------
>
>                 Key: HBASE-27955
>                 URL: https://issues.apache.org/jira/browse/HBASE-27955
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Viraj Jasani
>            Priority: Major
>
> UpdatePeerConfigProcedure gets stuck when we see some failures in 
> RefreshPeerProcedure. The only way to move forward is either by restarting 
> active master or bypassing the stuck procedure.
>  
> For instance,
> {code:java}
> 2023-06-26 17:22:08,375 WARN  [,queue=24,port=61000] 
> replication.RefreshPeerProcedure - Refresh peer peer0 for UPDATE_CONFIG on 
> {host},{port},1687053857180 failed
> java.lang.NullPointerException via 
> {host},{port},1687053857180:java.lang.NullPointerException: 
>     at 
> org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2406)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2401)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16296)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:385)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:369)
>     at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:349)
> Caused by: java.lang.NullPointerException: 
>     at xyz(Abc.java:89)     <========= replication endpoint failure example
>     at xyz(Abc.java:79)     <========= replication endpoint failure example
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.lambda$setPeerConfig$0(ReplicationPeerImpl.java:63)
>     at java.util.ArrayList.forEach(ArrayList.java:1259)
>     at 
> org.apache.hadoop.hbase.replication.ReplicationPeerImpl.setPeerConfig(ReplicationPeerImpl.java:63)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.PeerProcedureHandlerImpl.updatePeerConfig(PeerProcedureHandlerImpl.java:131)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:70)
>     at 
> org.apache.hadoop.hbase.replication.regionserver.RefreshPeerCallable.call(RefreshPeerCallable.java:35)
>     at 
> org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:49)
>     at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:98)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:750) {code}
> RefreshPeerProcedure should support reporting this failure and rollback of 
> the parent procedure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to