[ https://issues.apache.org/jira/browse/HBASE-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553649#comment-17553649 ]
Duo Zhang commented on HBASE-26955: ----------------------------------- In RemoteProcedureResultReporter, we reset the tries because of we have successfully done a rpc call. And I do not fully understand the problem here, you want to reset the sleep time between retries when we get some special types of Exceptions? > Improvement of the pause time between retries in Rpc caller > ----------------------------------------------------------- > > Key: HBASE-26955 > URL: https://issues.apache.org/jira/browse/HBASE-26955 > Project: HBase > Issue Type: Improvement > Components: Client > Affects Versions: 2.4.2 > Reporter: Haoze Wu > Priority: Major > > We were using the HBase version 2.4.2 and had clients submitting requests to > a HBase cluster. We observed an unreasonable delay of 10 seconds in some > request of table create. We analyze the log and jstack results and reproduce > this issue. The jstack of the client shows that this wierd long delay comes > from > [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L168] > : > {code:java} > "main" #1 prio=5 os_prio=0 tid=0x00007f027c00b800 nid=0x530d in Object.wait() > [0x00007f02853b1000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000003d52674f0> (a > java.util.concurrent.atomic.AtomicBoolean) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168) > - locked <0x00000003d52674f0> (a > java.util.concurrent.atomic.AtomicBoolean) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3018) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3010) > at > org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:648) > at > org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:4326) > at org.apache.hadoop.hbase.client.Admin.createTable(Admin.java:317) > at > edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClient.createTable(HBaseGrayClient.java:63) > at > edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.create(HBaseGrayClientMain.java:70) > at > edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.run(HBaseGrayClientMain.java:157) > at > edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.main(HBaseGrayClientMain.java:183) > {code} > This delay of wait comes from the calculation within > [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L96-L177] > , which can be simplified as follows: > {code:java} > public T callWithRetries(RetryingCallable<T> callable, int callTimeout) > throws IOException, RuntimeException { > // ... > for (int tries = 0;; tries++) { > long expectedSleep; > try { > // do the call ... > } catch (PreemptiveFastFailException e) { > throw e; > } catch (Throwable t) { > ExceptionUtil.rethrowIfInterrupt(t); > Throwable cause = t.getCause(); > if (cause instanceof DoNotRetryIOException) { // line 113 > // Fail fast > throw (DoNotRetryIOException) cause; > } > // translateException throws exception when should not retry: i.e. > when request is bad. > interceptor.handleFailure(context, t); > t = translateException(t); > if (tries > startLogErrorsCnt) { > // log ... > } > // ... > // If the server is dead, we need to wait a little before retrying, > to give > // a chance to the regions to be moved > // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where > pauseBase might be > // special when encountering CallQueueTooBigException, see > #HBASE-17114 > long pauseBase = (t instanceof CallQueueTooBigException) ? > pauseForCQTBE : pause; // line 151 > expectedSleep = callable.sleep(pauseBase, tries); // line 152 > // If, after the planned sleep, there won't be enough time left, we > stop now. > long duration = singleCallDuration(expectedSleep); > if (duration > callTimeout) { > // throw exception ... > } > } finally { > interceptor.updateFailureInfo(context); > } > try { > if (expectedSleep > 0) { > synchronized (cancelled) { > if (cancelled.get()) return null; > cancelled.wait(expectedSleep); // line 168 > } > } > if (cancelled.get()) return null; > } catch (InterruptedException e) { > throw new InterruptedIOException("Interrupted after " + tries > + " tries while maxAttempts=" + maxAttempts); > } > } > } {code} > Within the retry loop, the `cause` variable represents the reason for the > current failed request and it is checked to determine what we should do. For > example, the case of DoNotRetryIOException is checked at line 113, and the > case of CallQueueTooBigException is checked at line 151. Finally we determine > `expectedSleep`, the time we should wait before the next retry. And then we > enforce this wait at line 168, unless it is canceled. > The comments seems to imply that this retry is due to dead server: > {code:java} > // If the server is dead, we need to wait a little before retrying, > to give > // a chance to the regions to be moved > // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where > pauseBase might be > // special when encountering CallQueueTooBigException, see > #HBASE-17114 > long pauseBase = (t instanceof CallQueueTooBigException) ? > pauseForCQTBE : pause; // line 151 > expectedSleep = callable.sleep(pauseBase, tries); // line 152 {code} > However, the client log shows that some retries are not due to dead server: > {code:java} > 2022-04-10 15:44:44,146 INFO [main] client.RpcRetryingCallerImpl: Call > exception, tries=6, retries=16, started=4858 ms ago, cancelled=false, > msg=org.apache.hadoop.hbase.PleaseHoldException: Master is initializing > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2729) > at > org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1985) > at > org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > , details=, see https://s.apache.org/timeout > 2022-04-10 15:44:48,181 INFO [main] client.RpcRetryingCallerImpl: Call > exception, tries=7, retries=16, started=8893 ms ago, cancelled=false, > msg=java.io.IOException: java.io.IOException: IOException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:463) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318) > Caused by: java.io.UncheckedIOException: java.io.IOException: IOException > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:377) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.runWithoutRpcCall(RegionProcedureStore.java:313) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:364) > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:323) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1047) > at > org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil$NonceProcedureRunnable.submitProcedure(MasterProcedureUtil.java:107) > at org.apache.hadoop.hbase.master.HMaster$4.run(HMaster.java:2010) > at > org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134) > at > org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1997) > at > org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395) > ... 3 more > Caused by: java.io.IOException: IOException > at > org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:370) > ... 14 more > , details=, see https://s.apache.org/timeout > 2022-04-10 15:44:58,835 INFO [main] client.HBaseAdmin: Operation: CREATE, > Table Name: default:t4, procId: 22 completed > {code} > The client has some retries due to dead server (PleaseHoldException: Master > is initializing), and then gets some internal error when creating the table > and finally gets the table created successfully. We understand that the > waiting time for each retry increases exponentially as the retry number > grows, according to the retry backoff mechanism. However, since the comments > before line 151 does not specify how we should determine the retry pause > time, we suspect that the retry time after the successful server startup > should be considered separately, regardless of the time used for waiting for > server startup. To confirm this hypothesis, we find some RPC retry call > implementations in the same version (2.4.2) of HBase: > [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RemoteProcedureResultReporter.java#L95-L111] > and > [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2453-L2475] > . > For example, in RemoteProcedureResultReporter, the retry time is calculated > separately, after the server startup: > {code:java} > while (!server.isStopped()) { > // ... > try { > server.reportProcedureDone(builder.build()); > builder.clear(); > tries = 0; > } catch (IOException e) { > boolean pause = > e instanceof ServerNotRunningYetException || e instanceof > PleaseHoldException; > long pauseTime; > if (pause) { > // Do backoff else we flood the Master with requests. > pauseTime = ConnectionUtils.getPauseTime(INIT_PAUSE_TIME_MS, tries); > } else { > pauseTime = INIT_PAUSE_TIME_MS; // Reset. > } > LOG.info("Failed procedure report " + > TextFormat.shortDebugString(request) + "; retry (#" + > tries + ")" + (pause ? " after " + pauseTime + "ms delay (Master is > coming online...)." > : " immediately."), > e); > Threads.sleep(pauseTime); > tries++; > } > } {code} > Similarly, we suspect RpcRetryingCallerImpl should consider > ServerNotRunningYetException and PleaseHoldException in this way. We followed > this thought to write a fix and found that the symptom no longer exists. We > had tested this issue and the fix in version 2.4.2 and 2.4.11 and > 3.0.0-alpha-2. We believe this issue exists in these versions and our fix is > applicable to these versions. > P.S. In the latest version (3.0.0-alpha-2), RpcRetryingCallerImpl no longer > exists. The issue and fix is applied to AsyncRpcRetryingCaller instead. -- This message was sent by Atlassian Jira (v8.20.7#820007)