[ https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759850#comment-17759850 ]
Duo Zhang commented on HBASE-28048: ----------------------------------- Aborting master does not help here, the new master will still try to send the procedure to the same region server. We could add a log message to mention that we have retried for a long time and can not succeed but the region server is still alive, please check manually. > RSProcedureDispatcher to abort executing request after configurable retries > --------------------------------------------------------------------------- > > Key: HBASE-28048 > URL: https://issues.apache.org/jira/browse/HBASE-28048 > Project: HBase > Issue Type: Improvement > Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5 > Reporter: Viraj Jasani > Priority: Major > Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1 > > > In a recent incident, we observed that RSProcedureDispatcher continues > executing region open/close procedures with unbounded retries even in the > presence of known failures like GSS initiate failure: > > {code:java} > 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] > procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed > due to java.io.IOException: Call to address=rs1:61020 failed on local > exception: java.io.IOException: > org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: > org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS > initiate failed, try=0, retrying... {code} > > > If the remote execution results in IOException, the dispatcher attempts to > schedule the procedure for further retries: > > {code:java} > private boolean scheduleForRetry(IOException e) { > LOG.debug("Request to {} failed, try={}", serverName, > numberOfAttemptsSoFar, e); > // Should we wait a little before retrying? If the server is starting > it's yes. > ... > ... > ... > numberOfAttemptsSoFar++; > // Add some backoff here as the attempts rise otherwise if a stuck > condition, will fill logs > // with failed attempts. None of our backoff classes -- RetryCounter or > ClientBackoffPolicy > // -- fit here nicely so just do something simple; increment by > rsRpcRetryInterval millis * > // retry^2 on each try > // up to max of 10 seconds (don't want to back off too much in case of > situation change). > submitTask(this, > Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * > this.numberOfAttemptsSoFar), > 10 * 1000), > TimeUnit.MILLISECONDS); > return true; > } > {code} > > > Even though we try to provide backoff while retrying, max wait time is 10s: > > {code:java} > submitTask(this, > Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * > this.numberOfAttemptsSoFar), > 10 * 1000), > TimeUnit.MILLISECONDS); {code} > > > This results in endless loop of retries, until either the underlying issue is > fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing > open/close region procedure (and perhaps entire SCP) for the affected > regionserver is sidelined manually. > {code:java} > 2023-08-25 03:04:18,918 WARN [ispatcher-pool-41274] > procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed > due to java.io.IOException: Call to address=rs1:61020 failed on local > exception: java.io.IOException: > org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: > org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS > initiate failed, try=217, retrying... > 2023-08-25 03:04:18,916 WARN [ispatcher-pool-41280] > procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed > due to java.io.IOException: Call to address=rs1:61020 failed on local > exception: java.io.IOException: > org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: > org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS > initiate failed, try=193, retrying... > 2023-08-25 03:04:28,968 WARN [ispatcher-pool-41315] > procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed > due to java.io.IOException: Call to address=rs1:61020 failed on local > exception: java.io.IOException: > org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: > org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS > initiate failed, try=266, retrying... > 2023-08-25 03:04:28,969 WARN [ispatcher-pool-41240] > procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed > due to java.io.IOException: Call to address=rs1:61020 failed on local > exception: java.io.IOException: > org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: > org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS > initiate failed, try=266, retrying...{code} > > While external issues like "krb ticket expiry" requires operator > intervention, it is not prudent to fill up the active handlers with endless > retries while attempting to execute RPC on only single affected regionserver. > This eventually leads to overall cluster state degradation, specifically in > the event of multiple regionserver restarts resulting from any planned > activities. > One of the resolutions here would be: > # Configure max retries as part of ExecuteProceduresRequest request (or it > could be part of RemoteProcedureRequest) > # This retry count should be used by RSProcedureDispatcher while scheduling > request failures for further retries > # After exhausting retries, mark the failure to the remote call, and bubble > up the failure to parent procedure. > If the series of above mentioned calls result into aborting active master, we > should clearly log the FATAL/ERROR msg with the underlying root cause (e.g. > GSS initiate failure in this case), which can help operator to either fix the > krb ticket expiry or abort the regionserver, which would lead to SCP > performing the heavy task of WAL splitting recoveries, however this would not > prevent other procedures as well as active handlers from getting stuck > executing remote calls without any conditional termination. -- This message was sent by Atlassian Jira (v8.20.10#820010)