[
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani resolved HBASE-28048.
----------------------------------
Assignee: Viraj Jasani
Resolution: Resolved
Resolved through sub-tasks.
> RSProcedureDispatcher to abort executing request after configurable retries
> ---------------------------------------------------------------------------
>
> Key: HBASE-28048
> URL: https://issues.apache.org/jira/browse/HBASE-28048
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.5
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
>
> In a recent incident, we observed that RSProcedureDispatcher continues
> executing region open/close procedures with unbounded retries even in the
> presence of known failures like GSS initiate failure:
>
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777]
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed
> due to java.io.IOException: Call to address=rs1:61020 failed on local
> exception: java.io.IOException:
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException:
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS
> initiate failed, try=0, retrying... {code}
>
>
> If the remote execution results in IOException, the dispatcher attempts to
> schedule the procedure for further retries:
>
> {code:java}
> private boolean scheduleForRetry(IOException e) {
> LOG.debug("Request to {} failed, try={}", serverName,
> numberOfAttemptsSoFar, e);
> // Should we wait a little before retrying? If the server is starting
> it's yes.
> ...
> ...
> ...
> numberOfAttemptsSoFar++;
> // Add some backoff here as the attempts rise otherwise if a stuck
> condition, will fill logs
> // with failed attempts. None of our backoff classes -- RetryCounter or
> ClientBackoffPolicy
> // -- fit here nicely so just do something simple; increment by
> rsRpcRetryInterval millis *
> // retry^2 on each try
> // up to max of 10 seconds (don't want to back off too much in case of
> situation change).
> submitTask(this,
> Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar *
> this.numberOfAttemptsSoFar),
> 10 * 1000),
> TimeUnit.MILLISECONDS);
> return true;
> }
> {code}
>
>
> Even though we try to provide backoff while retrying, max wait time is 10s:
>
> {code:java}
> submitTask(this,
> Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar *
> this.numberOfAttemptsSoFar),
> 10 * 1000),
> TimeUnit.MILLISECONDS); {code}
>
>
> This results in endless loop of retries, until either the underlying issue is
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing
> open/close region procedure (and perhaps entire SCP) for the affected
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN [ispatcher-pool-41274]
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed
> due to java.io.IOException: Call to address=rs1:61020 failed on local
> exception: java.io.IOException:
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException:
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN [ispatcher-pool-41280]
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed
> due to java.io.IOException: Call to address=rs1:61020 failed on local
> exception: java.io.IOException:
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException:
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN [ispatcher-pool-41315]
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed
> due to java.io.IOException: Call to address=rs1:61020 failed on local
> exception: java.io.IOException:
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException:
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN [ispatcher-pool-41240]
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed
> due to java.io.IOException: Call to address=rs1:61020 failed on local
> exception: java.io.IOException:
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException:
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS
> initiate failed, try=266, retrying...{code}
>
> While external issues like "krb ticket expiry" requires operator
> intervention, it is not prudent to fill up the active handlers with endless
> retries while attempting to execute RPC on only single affected regionserver.
> This eventually leads to overall cluster state degradation, specifically in
> the event of multiple regionserver restarts resulting from any planned
> activities.
> One of the resolutions here would be:
> # Configure max retries as part of ExecuteProceduresRequest request (or it
> could be part of RemoteProcedureRequest)
> # This retry count should be used by RSProcedureDispatcher while scheduling
> request failures for further retries
> # After exhausting retries, mark the failure to the remote call, and bubble
> up the failure to parent procedure.
> If the series of above mentioned calls result into aborting active master, we
> should clearly log the FATAL/ERROR msg with the underlying root cause (e.g.
> GSS initiate failure in this case), which can help operator to either fix the
> krb ticket expiry or abort the regionserver, which would lead to SCP
> performing the heavy task of WAL splitting recoveries, however this would not
> prevent other procedures as well as active handlers from getting stuck
> executing remote calls without any conditional termination.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)