[jira] [Commented] (HBASE-26955) Improvement of the pause time between retries in Rpc caller

Duo Zhang (Jira) Mon, 13 Jun 2022 08:25:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553649#comment-17553649
 ]


Duo Zhang commented on HBASE-26955:
-----------------------------------

In RemoteProcedureResultReporter, we reset the tries because of we have 
successfully done a rpc call.

And I do not fully understand the problem here, you want to reset the sleep 
time between retries when we get some special types of Exceptions?

> Improvement of the pause time between retries in Rpc caller
> -----------------------------------------------------------
>
>                 Key: HBASE-26955
>                 URL: https://issues.apache.org/jira/browse/HBASE-26955
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 2.4.2
>            Reporter: Haoze Wu
>            Priority: Major
>
> We were using the HBase version 2.4.2 and had clients submitting requests to 
> a HBase cluster. We observed an unreasonable delay of 10 seconds in some 
> request of table create. We analyze the log and jstack results and reproduce 
> this issue. The jstack of the client shows that this wierd long delay comes 
> from 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L168]
>  :
> {code:java}
> "main" #1 prio=5 os_prio=0 tid=0x00007f027c00b800 nid=0x530d in Object.wait() 
> [0x00007f02853b1000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x00000003d52674f0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
>         - locked <0x00000003d52674f0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3018)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3010)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:648)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:4326)
>         at org.apache.hadoop.hbase.client.Admin.createTable(Admin.java:317)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClient.createTable(HBaseGrayClient.java:63)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.create(HBaseGrayClientMain.java:70)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.run(HBaseGrayClientMain.java:157)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.main(HBaseGrayClientMain.java:183)
> {code}
> This delay of wait comes from the calculation within 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L96-L177]
>  , which can be simplified as follows:
> {code:java}
>   public T callWithRetries(RetryingCallable<T> callable, int callTimeout)
>   throws IOException, RuntimeException {
>     // ...
>     for (int tries = 0;; tries++) {
>       long expectedSleep;
>       try {
>         // do the call ...
>       } catch (PreemptiveFastFailException e) {
>         throw e;
>       } catch (Throwable t) {
>         ExceptionUtil.rethrowIfInterrupt(t);
>         Throwable cause = t.getCause();
>         if (cause instanceof DoNotRetryIOException) {  // line 113
>           // Fail fast
>           throw (DoNotRetryIOException) cause;
>         }
>         // translateException throws exception when should not retry: i.e. 
> when request is bad.
>         interceptor.handleFailure(context, t);
>         t = translateException(t);
>         if (tries > startLogErrorsCnt) {
>           // log ...
>         }
>         // ...
>         // If the server is dead, we need to wait a little before retrying, 
> to give
>         // a chance to the regions to be moved
>         // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where 
> pauseBase might be
>         // special when encountering CallQueueTooBigException, see 
> #HBASE-17114
>         long pauseBase = (t instanceof CallQueueTooBigException) ? 
> pauseForCQTBE : pause;  // line 151
>         expectedSleep = callable.sleep(pauseBase, tries); //  line 152
>         // If, after the planned sleep, there won't be enough time left, we 
> stop now.
>         long duration = singleCallDuration(expectedSleep);
>         if (duration > callTimeout) {
>           // throw exception ...
>         }
>       } finally {
>         interceptor.updateFailureInfo(context);
>       }
>       try {
>         if (expectedSleep > 0) {
>           synchronized (cancelled) {
>             if (cancelled.get()) return null;
>             cancelled.wait(expectedSleep);  // line 168
>           }
>         }
>         if (cancelled.get()) return null;
>       } catch (InterruptedException e) {
>         throw new InterruptedIOException("Interrupted after " + tries
>             + " tries while maxAttempts=" + maxAttempts);
>       }
>     }
>   } {code}
> Within the retry loop, the `cause` variable represents the reason for the 
> current failed request and it is checked to determine what we should do. For 
> example, the case of DoNotRetryIOException is checked at line 113, and the 
> case of CallQueueTooBigException is checked at line 151. Finally we determine 
> `expectedSleep`, the time we should wait before the next retry. And then we 
> enforce this wait at line 168, unless it is canceled.
> The comments seems to imply that this retry is due to dead server:
> {code:java}
>         // If the server is dead, we need to wait a little before retrying, 
> to give
>         // a chance to the regions to be moved
>         // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where 
> pauseBase might be
>         // special when encountering CallQueueTooBigException, see 
> #HBASE-17114
>         long pauseBase = (t instanceof CallQueueTooBigException) ? 
> pauseForCQTBE : pause;  // line 151
>         expectedSleep = callable.sleep(pauseBase, tries); //  line 152 {code}
> However, the client log shows that some retries are not due to dead server:
> {code:java}
> 2022-04-10 15:44:44,146 INFO  [main] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=16, started=4858 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
>         at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2729)
>         at 
> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1985)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> , details=, see https://s.apache.org/timeout
> 2022-04-10 15:44:48,181 INFO  [main] client.RpcRetryingCallerImpl: Call 
> exception, tries=7, retries=16, started=8893 ms ago, cancelled=false, 
> msg=java.io.IOException: java.io.IOException: IOException
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:463)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> Caused by: java.io.UncheckedIOException: java.io.IOException: IOException
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:377)
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.runWithoutRpcCall(RegionProcedureStore.java:313)
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:364)
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:323)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1047)
>         at 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil$NonceProcedureRunnable.submitProcedure(MasterProcedureUtil.java:107)
>         at org.apache.hadoop.hbase.master.HMaster$4.run(HMaster.java:2010)
>         at 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134)
>         at 
> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1997)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
>         ... 3 more
> Caused by: java.io.IOException: IOException
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:370)
>         ... 14 more
> , details=, see https://s.apache.org/timeout
> 2022-04-10 15:44:58,835 INFO  [main] client.HBaseAdmin: Operation: CREATE, 
> Table Name: default:t4, procId: 22 completed
> {code}
> The client has some retries due to dead server (PleaseHoldException: Master 
> is initializing), and then gets some internal error when creating the table 
> and finally gets the table created successfully. We understand that the 
> waiting time for each retry increases exponentially as the retry number 
> grows, according to the retry backoff mechanism. However, since the comments 
> before line 151 does not specify how we should determine the retry pause 
> time, we suspect that the retry time after the successful server startup 
> should be considered separately, regardless of the time used for waiting for 
> server startup. To confirm this hypothesis, we find some RPC retry call 
> implementations in the same version (2.4.2) of HBase: 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RemoteProcedureResultReporter.java#L95-L111]
>  and 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2453-L2475]
>  .
> For example, in RemoteProcedureResultReporter, the retry time is calculated 
> separately, after the server startup:
> {code:java}
>    while (!server.isStopped()) {
>       // ...
>       try {
>         server.reportProcedureDone(builder.build());
>         builder.clear();
>         tries = 0;
>       } catch (IOException e) {
>         boolean pause =
>           e instanceof ServerNotRunningYetException || e instanceof 
> PleaseHoldException;
>         long pauseTime;
>         if (pause) {
>           // Do backoff else we flood the Master with requests.
>           pauseTime = ConnectionUtils.getPauseTime(INIT_PAUSE_TIME_MS, tries);
>         } else {
>           pauseTime = INIT_PAUSE_TIME_MS; // Reset.
>         }
>         LOG.info("Failed procedure report " + 
> TextFormat.shortDebugString(request) + "; retry (#" +
>           tries + ")" + (pause ? " after " + pauseTime + "ms delay (Master is 
> coming online...)."
>             : " immediately."),
>           e);
>         Threads.sleep(pauseTime);
>         tries++;
>       }
>     } {code}
> Similarly, we suspect RpcRetryingCallerImpl should consider 
> ServerNotRunningYetException and PleaseHoldException in this way. We followed 
> this thought to write a fix and found that the symptom no longer exists. We 
> had tested this issue and the fix in version 2.4.2 and 2.4.11 and 
> 3.0.0-alpha-2. We believe this issue exists in these versions and our fix is 
> applicable to these versions.
> P.S. In the latest version (3.0.0-alpha-2), RpcRetryingCallerImpl no longer 
> exists. The issue and fix is applied to AsyncRpcRetryingCaller instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HBASE-26955) Improvement of the pause time between retries in Rpc caller

Reply via email to