Haoze Wu created HBASE-26955:
--------------------------------

             Summary: Improvement of the pause time between retries in Rpc 
caller
                 Key: HBASE-26955
                 URL: https://issues.apache.org/jira/browse/HBASE-26955
             Project: HBase
          Issue Type: Improvement
          Components: Client
    Affects Versions: 2.4.2
            Reporter: Haoze Wu


We were using the HBase version 2.4.2 and had clients submitting requests to a 
HBase cluster. We observed an unreasonable delay of 10 seconds in some request 
of table create. We analyze the log and jstack results and reproduce this 
issue. The jstack of the client shows that this wierd long delay comes from 
[https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L168]
 :

 
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x00007f027c00b800 nid=0x530d in Object.wait() 
[0x00007f02853b1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000003d52674f0> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
        - locked <0x00000003d52674f0> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3018)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3010)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:648)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:4326)
        at org.apache.hadoop.hbase.client.Admin.createTable(Admin.java:317)
        at 
edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClient.createTable(HBaseGrayClient.java:63)
        at 
edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.create(HBaseGrayClientMain.java:70)
        at 
edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.run(HBaseGrayClientMain.java:157)
        at 
edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.main(HBaseGrayClientMain.java:183)
{code}
This delay of wait comes from the calculation within 
[https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L96-L177]
 , which can be simplified as follows:
{code:java}
  public T callWithRetries(RetryingCallable<T> callable, int callTimeout)
  throws IOException, RuntimeException {
    // ...
    for (int tries = 0;; tries++) {
      long expectedSleep;
      try {
        // do the call ...
      } catch (PreemptiveFastFailException e) {
        throw e;
      } catch (Throwable t) {
        ExceptionUtil.rethrowIfInterrupt(t);
        Throwable cause = t.getCause();
        if (cause instanceof DoNotRetryIOException) {  // line 113
          // Fail fast
          throw (DoNotRetryIOException) cause;
        }
        // translateException throws exception when should not retry: i.e. when 
request is bad.
        interceptor.handleFailure(context, t);
        t = translateException(t);
        if (tries > startLogErrorsCnt) {
          // log ...
        }
        // ...
        // If the server is dead, we need to wait a little before retrying, to 
give
        // a chance to the regions to be moved
        // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where 
pauseBase might be
        // special when encountering CallQueueTooBigException, see #HBASE-17114
        long pauseBase = (t instanceof CallQueueTooBigException) ? 
pauseForCQTBE : pause;  // line 151
        expectedSleep = callable.sleep(pauseBase, tries); //  line 152
        // If, after the planned sleep, there won't be enough time left, we 
stop now.
        long duration = singleCallDuration(expectedSleep);
        if (duration > callTimeout) {
          // throw exception ...
        }
      } finally {
        interceptor.updateFailureInfo(context);
      }
      try {
        if (expectedSleep > 0) {
          synchronized (cancelled) {
            if (cancelled.get()) return null;
            cancelled.wait(expectedSleep);  // line 168
          }
        }
        if (cancelled.get()) return null;
      } catch (InterruptedException e) {
        throw new InterruptedIOException("Interrupted after " + tries
            + " tries while maxAttempts=" + maxAttempts);
      }
    }
  } {code}
Within the retry loop, the `cause` variable represents the reason for the 
current failed request and it is checked to determine what we should do. For 
example, the case of DoNotRetryIOException is checked at line 113, and the case 
of CallQueueTooBigException is checked at line 151. Finally we determine 
`expectedSleep`, the time we should wait before the next retry. And then we 
enforce this wait at line 168, unless it is canceled.

The comments seems to imply that this retry is due to dead server:
{code:java}
        // If the server is dead, we need to wait a little before retrying, to 
give
        // a chance to the regions to be moved
        // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where 
pauseBase might be
        // special when encountering CallQueueTooBigException, see #HBASE-17114
        long pauseBase = (t instanceof CallQueueTooBigException) ? 
pauseForCQTBE : pause;  // line 151
        expectedSleep = callable.sleep(pauseBase, tries); //  line 152 {code}
However, the client log shows that some retries are not due to dead server:
{code:java}
2022-04-10 15:44:44,146 INFO  [main] client.RpcRetryingCallerImpl: Call 
exception, tries=6, retries=16, started=4858 ms ago, cancelled=false, 
msg=org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
        at 
org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2729)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1985)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
, details=, see https://s.apache.org/timeout
2022-04-10 15:44:48,181 INFO  [main] client.RpcRetryingCallerImpl: Call 
exception, tries=7, retries=16, started=8893 ms ago, cancelled=false, 
msg=java.io.IOException: java.io.IOException: IOException
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:463)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
Caused by: java.io.UncheckedIOException: java.io.IOException: IOException
        at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:377)
        at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.runWithoutRpcCall(RegionProcedureStore.java:313)
        at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:364)
        at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:323)
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1047)
        at 
org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil$NonceProcedureRunnable.submitProcedure(MasterProcedureUtil.java:107)
        at org.apache.hadoop.hbase.master.HMaster$4.run(HMaster.java:2010)
        at 
org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134)
        at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1997)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
        ... 3 more
Caused by: java.io.IOException: IOException
        at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:370)
        ... 14 more
, details=, see https://s.apache.org/timeout
2022-04-10 15:44:58,835 INFO  [main] client.HBaseAdmin: Operation: CREATE, 
Table Name: default:t4, procId: 22 completed
{code}
The client has some retries due to dead server (PleaseHoldException: Master is 
initializing), and then gets some internal error when creating the table and 
finally gets the table created successfully. We understand that the waiting 
time for each retry increases exponentially as the retry number grows, 
according to the retry backoff mechanism. However, since the comments before 
line 151 does not specify how we should determine the retry pause time, we 
suspect that the retry time after the successful server startup should be 
considered separately, regardless of the time used for waiting for server 
startup. To confirm this hypothesis, we find some RPC retry call 
implementations in the same version (2.4.2) of HBase: 
https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RemoteProcedureResultReporter.java#L95-L111
 and 
https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2453-L2475
 .

 

For example, in RemoteProcedureResultReporter, the retry time is calculated 
separately, after the server startup:
{code:java}
   while (!server.isStopped()) {
      // ...
      try {
        server.reportProcedureDone(builder.build());
        builder.clear();
        tries = 0;
      } catch (IOException e) {
        boolean pause =
          e instanceof ServerNotRunningYetException || e instanceof 
PleaseHoldException;
        long pauseTime;
        if (pause) {
          // Do backoff else we flood the Master with requests.
          pauseTime = ConnectionUtils.getPauseTime(INIT_PAUSE_TIME_MS, tries);
        } else {
          pauseTime = INIT_PAUSE_TIME_MS; // Reset.
        }
        LOG.info("Failed procedure report " + 
TextFormat.shortDebugString(request) + "; retry (#" +
          tries + ")" + (pause ? " after " + pauseTime + "ms delay (Master is 
coming online...)."
            : " immediately."),
          e);
        Threads.sleep(pauseTime);
        tries++;
      }
    } {code}
Similarly, we suspect RpcRetryingCallerImpl should consider 
ServerNotRunningYetException and PleaseHoldException in this way. We followed 
this thought to write a fix and found that the symptom no longer exists. We had 
tested this issue and the fix in version 2.4.2 and 2.4.11 and 3.0.0-alpha-2. We 
believe this issue exists in these versions and our fix is applicable to these 
versions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to