[Impala-ASF-CR] IMPALA-8143: Enhance DoRpcWithRetry().

Michael Ho (Code Review) Wed, 10 Apr 2019 18:19:16 -0700

Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12672 )


Change subject: IMPALA-8143: Enhance DoRpcWithRetry().
......................................................................


Patch Set 9:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.h
File be/src/rpc/rpc-mgr-test.h:

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.h@134
PS9, Line 134: get_current_queue_size
Coding style suggests this should be "GetCurrentQueueSize()". With that said, 
it may not be needed after all if you take the suggsetion in rpc-mgr-test.cc.


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.h@295
PS9, Line 295: kudu::Status::IOError(
             :         Substitute("ping failing, number of calls=$0.", 
number_of_calls_));
Instead of IOError, can we simulate the server busy error here ? Please see 
comments in rpc-mgr-test.cc for reason.


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc
File be/src/rpc/rpc-mgr-test.cc:

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@269
PS9, Line 269:  until
typo


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@285
PS9, Line 285:   // Find the counter which tracks the number of times the 
service queue is too full.
             :   const string& overflow_count = Substitute(
             :       ImpalaServicePool::RPC_QUEUE_OVERFLOW_METRIC_KEY, 
ping_service->service_name());
             :   IntCounter* overflow_metric =
             :       
ExecEnv::GetInstance()->rpc_metrics()->FindMetricForTesting<IntCounter>(
             :           overflow_count);
             :   ASSERT_TRUE(overflow_metric != nullptr);
             :   // There has been no activity yet so this should be 0.
             :   EXPECT_EQ(overflow_metric->GetValue(), 0L);
             :
             :   // Make a proxy for Ping rpc calls.
             :   unique_ptr<PingServiceProxy> proxy;
             :   ASSERT_OK(ping_service->GetProxy(krpc_address_, 
FLAGS_hostname, &proxy));
             :   TQueryCtx query_ctx;
             :
             :   // Do an async call that will run in the service's single 
thread.
             :   RpcController async_controller;
             :   CountingBarrier async1_done(1);
             :   PingRequestPB request1;
             :   PingResponsePB response1;
             :   ResponseCallback async1_callback = [&]() { 
async1_done.Notify(); };
             :   proxy->PingAsync(request1, &response1, &async_controller, 
async1_callback);
             :
             :   // wait for the async call to be running.
             :   bool timed_out = false;
             :   call1_started.Wait(10 * MILLIS_PER_SEC, &timed_out);
             :   ASSERT_FALSE(timed_out);
             :   ASSERT_EQ(0, get_current_queue_size(rpc_mgr_));
             :   // Now the async call is running in the service.
             :
             :   // Do a second async call, this will set in the queue, which 
will be full.
             :   RpcController async_controller2;
             :   CountingBarrier async2_done(1);
             :   // The second call only has a short sleep.
             :   PingRequestPB request2;
             :   PingResponsePB response2;
             :   // Set the deadline to something reasonable otherwise the 
pings from DoRpcWithRetry
             :   // will replace this async call in the queue.
             :   MonoTime deadline = MonoTime::Now() + 
MonoDelta::FromSeconds(10);
             :   async_controller2.set_deadline(deadline);
             :   ResponseCallback async2_callback = [&]() { 
async2_done.Notify(); };
             :   proxy->PingAsync(request2, &response2, &async_controller2, 
async2_callback);
             :
             :   // Wait for the second async call to get queued.
             :   int num_sleeps = 20;
             :   while (get_current_queue_size(rpc_mgr_) != 1) {
             :     SleepForMs(300);
             :     if (--num_sleeps < 0) {
             :       FAIL() << "Waited too long for async message to be queued";
             :     }
             :   }
             :   // Now the second async call is queued, and the queue is full.
             :
             :   // Assert that there have been no overflows yet.
             :   EXPECT_EQ(overflow_metric->GetValue(), 0L);
             :
             :   // Call DoRpcWithRetry with retry on our busy service.
             :   // This won't succeed, but we can observe that retries 
occurred.
             :   const int retries = 3;
             :   PingRequestPB request3;
             :   PingResponsePB response3;
             :   impala::Status rpc_status = RpcMgr::DoRpcWithRetry(proxy, 
&PingServiceProxy::Ping,
             :       request3, &response3, query_ctx, "ping failed", retries, 
100 * MILLIS_PER_SEC);
             :   EXPECT_ERROR(rpc_status, TErrorCode::GENERAL);
             :   EXPECT_STR_CONTAINS(rpc_status.msg().msg(),
             :       "dropped due to backpressure. The service queue contains 1 
items out of a maximum "
             :       "of 1; memory consumption is ");
             :   ASSERT_EQ(overflow_metric->GetValue(), retries);
Stepping back a bit, it's unclear the exact purposes of these tests are.

It appears that these tests aim to exercise (1) the retry path in the client 
code and (2) the "server busy" code in the Kudu server. For (2), we already 
have SlowCallback() so I don't see the need for another test for it.

So, it seems that we only need a test which exercises the retry loop logic in 
the client code. In which case, please see comments below on how the test below 
can be modified slightly to provide sufficient coverage.

If you agree with the suggestion below, we can delete these tests and any 
auxiliary functions added to support them.


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@376
PS9, Line 376:   ASSERT_FALSE(inject_status.ok());
             :   EXPECT_ERROR(inject_status, TErrorCode::INTERNAL_ERROR);
             :   ASSERT_EQ("Debug Action: DoRpcWithRetry:FAIL", 
inject_status.msg().msg());
Not entirely sure this test is needed as the EE test should exercise it already.


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@380
PS9, Line 380:   // Test how DoRpcWithRetry retries by using a proxy that 
always fails.
             :   unique_ptr<FailingPingServiceProxy> failing_proxy =
             :       make_unique<FailingPingServiceProxy>();
             :   // A call that fails is not retried as the server is not busy.
             :   PingRequestPB request6;
             :   PingResponsePB response6;
             :   Status rpc_status_fail =
             :       RpcMgr::DoRpcWithRetry(failing_proxy, 
&FailingPingServiceProxy::Ping, request6,
             :           &response6, query_ctx, "ping failed", 3, 10 * 
MILLIS_PER_SEC);
             :   ASSERT_FALSE(rpc_status_fail.ok());
             :   // Assert that proxy was only called once.
             :   ASSERT_EQ(1, failing_proxy->getNumberOfCalls());
This test seems to be sufficient for the coverage we need if we make the 
following modifications:

- In FailingPingService (which could be renamed to CountingPingService), it can 
define an RPC which takes an integer as the RPC parameter. If the number of 
cumulative calls to that service is below that specified number, the call will 
return with the ServerBusy error status (instead of IOError).

- In the test, issue two different RPC calls to the CountingService: one with a 
parameter large enough that the call will fail after retrying n times and 
another subsequent RPC call which will succeed after n+m-1 times where n and m 
are the number of retries specified for each RPC calls respectively.

In this way, we will guarantee that the retry loop in the client code is 
exercised.


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr.inline.h
File be/src/rpc/rpc-mgr.inline.h:

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr.inline.h@75
PS9, Line 75:     // If the call succeeded then return.
            :     if (rpc_status.ok()) break;
            :
            :     // If server is not busy then give up and return result to 
caller.
            :     if (!RpcMgr::IsServerTooBusy(rpc_controller)) break;
if (rpc_status.ok() || !RpcMgr::IsServerTooBusy(rpc_controller)) {
    return rpc_status;
}


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/runtime/coordinator-backend-state.cc@465
PS9, Line 465: 3, 10 * MILLIS_PER_SEC,
             :           3 * MILLIS_PER_SEC,
nit: May make for better documentation if we have these couple of parameters 
defined as variables like the following:

  const int num_retries = 3;
  const int64_t timeout_ms = 10 * MILLIS_PER_SEC;
  const int64_t backoff_time_ms = 3 * MILLIS_PER_SEC;


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/service/client-request-state.cc@684
PS9, Line 684: 3, 10 * MILLIS_PER_SEC,
             :       3 * MILLIS_PER_SEC
nit: May make for better documentation if we have these couple of parameters 
defined as variables like the following:

  const int num_retries = 3;
  const int64_t timeout_ms = 10 * MILLIS_PER_SEC;
  const int64_t backoff_time_ms = 3 * MILLIS_PER_SEC;


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/testutil/fault-injection-util.h
File be/src/testutil/fault-injection-util.h:

http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/testutil/fault-injection-util.h@37
PS9, Line 37:     RPC_TRANSMITDATA,
This can also be removed.


http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/testutil/fault-injection-util.h@40
PS9, Line 40:     // If you add a new RpcCallType then bump the 
"--fault_injection_rpc_type=X" parameter
            :     // in the test test_random_rpc_timeout() in 
test_rpc_timeout.py.
Looks like we forgot to update the tests when adding fault injection for remote 
shutdowns. Thanks for the new comment.



--
To view, visit http://gerrit.cloudera.org:8080/12672
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia9693151c35e02235665b3c285a48c585973d390
Gerrit-Change-Number: 12672
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Apr 2019 01:12:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8143: Enhance DoRpcWithRetry().

Reply via email to