[ 
https://issues.apache.org/jira/browse/HBASE-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555282#comment-17555282
 ] 

Haoze Wu commented on HBASE-26955:
----------------------------------

[~zhangduo] 

In short, what I'm proposing is the fine-grained exponential backoff 
separately, based on the error. For example, we maintain the retries counter 
for different types of exception separately. Essentially they should have 
different state machines.

We previously had discussed it in the pull request 
[https://github.com/apache/hbase/pull/4349] .

I'm looking for community feedback and comments on my thoughts.

> Improvement of the pause time between retries in Rpc caller
> -----------------------------------------------------------
>
>                 Key: HBASE-26955
>                 URL: https://issues.apache.org/jira/browse/HBASE-26955
>             Project: HBase
>          Issue Type: Improvement
>          Components: Client
>    Affects Versions: 2.4.2
>            Reporter: Haoze Wu
>            Priority: Major
>
> We were using the HBase version 2.4.2 and had clients submitting requests to 
> a HBase cluster. We observed an unreasonable delay of 10 seconds in some 
> request of table create. We analyze the log and jstack results and reproduce 
> this issue. The jstack of the client shows that this wierd long delay comes 
> from 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L168]
>  :
> {code:java}
> "main" #1 prio=5 os_prio=0 tid=0x00007f027c00b800 nid=0x530d in Object.wait() 
> [0x00007f02853b1000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x00000003d52674f0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
>         - locked <0x00000003d52674f0> (a 
> java.util.concurrent.atomic.AtomicBoolean)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3018)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3010)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:648)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:4326)
>         at org.apache.hadoop.hbase.client.Admin.createTable(Admin.java:317)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClient.createTable(HBaseGrayClient.java:63)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.create(HBaseGrayClientMain.java:70)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.run(HBaseGrayClientMain.java:157)
>         at 
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.main(HBaseGrayClientMain.java:183)
> {code}
> This delay of wait comes from the calculation within 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L96-L177]
>  , which can be simplified as follows:
> {code:java}
>   public T callWithRetries(RetryingCallable<T> callable, int callTimeout)
>   throws IOException, RuntimeException {
>     // ...
>     for (int tries = 0;; tries++) {
>       long expectedSleep;
>       try {
>         // do the call ...
>       } catch (PreemptiveFastFailException e) {
>         throw e;
>       } catch (Throwable t) {
>         ExceptionUtil.rethrowIfInterrupt(t);
>         Throwable cause = t.getCause();
>         if (cause instanceof DoNotRetryIOException) {  // line 113
>           // Fail fast
>           throw (DoNotRetryIOException) cause;
>         }
>         // translateException throws exception when should not retry: i.e. 
> when request is bad.
>         interceptor.handleFailure(context, t);
>         t = translateException(t);
>         if (tries > startLogErrorsCnt) {
>           // log ...
>         }
>         // ...
>         // If the server is dead, we need to wait a little before retrying, 
> to give
>         // a chance to the regions to be moved
>         // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where 
> pauseBase might be
>         // special when encountering CallQueueTooBigException, see 
> #HBASE-17114
>         long pauseBase = (t instanceof CallQueueTooBigException) ? 
> pauseForCQTBE : pause;  // line 151
>         expectedSleep = callable.sleep(pauseBase, tries); //  line 152
>         // If, after the planned sleep, there won't be enough time left, we 
> stop now.
>         long duration = singleCallDuration(expectedSleep);
>         if (duration > callTimeout) {
>           // throw exception ...
>         }
>       } finally {
>         interceptor.updateFailureInfo(context);
>       }
>       try {
>         if (expectedSleep > 0) {
>           synchronized (cancelled) {
>             if (cancelled.get()) return null;
>             cancelled.wait(expectedSleep);  // line 168
>           }
>         }
>         if (cancelled.get()) return null;
>       } catch (InterruptedException e) {
>         throw new InterruptedIOException("Interrupted after " + tries
>             + " tries while maxAttempts=" + maxAttempts);
>       }
>     }
>   } {code}
> Within the retry loop, the `cause` variable represents the reason for the 
> current failed request and it is checked to determine what we should do. For 
> example, the case of DoNotRetryIOException is checked at line 113, and the 
> case of CallQueueTooBigException is checked at line 151. Finally we determine 
> `expectedSleep`, the time we should wait before the next retry. And then we 
> enforce this wait at line 168, unless it is canceled.
> The comments seems to imply that this retry is due to dead server:
> {code:java}
>         // If the server is dead, we need to wait a little before retrying, 
> to give
>         // a chance to the regions to be moved
>         // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where 
> pauseBase might be
>         // special when encountering CallQueueTooBigException, see 
> #HBASE-17114
>         long pauseBase = (t instanceof CallQueueTooBigException) ? 
> pauseForCQTBE : pause;  // line 151
>         expectedSleep = callable.sleep(pauseBase, tries); //  line 152 {code}
> However, the client log shows that some retries are not due to dead server:
> {code:java}
> 2022-04-10 15:44:44,146 INFO  [main] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=16, started=4858 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
>         at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2729)
>         at 
> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1985)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> , details=, see https://s.apache.org/timeout
> 2022-04-10 15:44:48,181 INFO  [main] client.RpcRetryingCallerImpl: Call 
> exception, tries=7, retries=16, started=8893 ms ago, cancelled=false, 
> msg=java.io.IOException: java.io.IOException: IOException
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:463)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> Caused by: java.io.UncheckedIOException: java.io.IOException: IOException
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:377)
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.runWithoutRpcCall(RegionProcedureStore.java:313)
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:364)
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:323)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1047)
>         at 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil$NonceProcedureRunnable.submitProcedure(MasterProcedureUtil.java:107)
>         at org.apache.hadoop.hbase.master.HMaster$4.run(HMaster.java:2010)
>         at 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134)
>         at 
> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1997)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
>         ... 3 more
> Caused by: java.io.IOException: IOException
>         at 
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:370)
>         ... 14 more
> , details=, see https://s.apache.org/timeout
> 2022-04-10 15:44:58,835 INFO  [main] client.HBaseAdmin: Operation: CREATE, 
> Table Name: default:t4, procId: 22 completed
> {code}
> The client has some retries due to dead server (PleaseHoldException: Master 
> is initializing), and then gets some internal error when creating the table 
> and finally gets the table created successfully. We understand that the 
> waiting time for each retry increases exponentially as the retry number 
> grows, according to the retry backoff mechanism. However, since the comments 
> before line 151 does not specify how we should determine the retry pause 
> time, we suspect that the retry time after the successful server startup 
> should be considered separately, regardless of the time used for waiting for 
> server startup. To confirm this hypothesis, we find some RPC retry call 
> implementations in the same version (2.4.2) of HBase: 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RemoteProcedureResultReporter.java#L95-L111]
>  and 
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2453-L2475]
>  .
> For example, in RemoteProcedureResultReporter, the retry time is calculated 
> separately, after the server startup:
> {code:java}
>    while (!server.isStopped()) {
>       // ...
>       try {
>         server.reportProcedureDone(builder.build());
>         builder.clear();
>         tries = 0;
>       } catch (IOException e) {
>         boolean pause =
>           e instanceof ServerNotRunningYetException || e instanceof 
> PleaseHoldException;
>         long pauseTime;
>         if (pause) {
>           // Do backoff else we flood the Master with requests.
>           pauseTime = ConnectionUtils.getPauseTime(INIT_PAUSE_TIME_MS, tries);
>         } else {
>           pauseTime = INIT_PAUSE_TIME_MS; // Reset.
>         }
>         LOG.info("Failed procedure report " + 
> TextFormat.shortDebugString(request) + "; retry (#" +
>           tries + ")" + (pause ? " after " + pauseTime + "ms delay (Master is 
> coming online...)."
>             : " immediately."),
>           e);
>         Threads.sleep(pauseTime);
>         tries++;
>       }
>     } {code}
> Similarly, we suspect RpcRetryingCallerImpl should consider 
> ServerNotRunningYetException and PleaseHoldException in this way. We followed 
> this thought to write a fix and found that the symptom no longer exists. We 
> had tested this issue and the fix in version 2.4.2 and 2.4.11 and 
> 3.0.0-alpha-2. We believe this issue exists in these versions and our fix is 
> applicable to these versions.
> P.S. In the latest version (3.0.0-alpha-2), RpcRetryingCallerImpl no longer 
> exists. The issue and fix is applied to AsyncRpcRetryingCaller instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to