Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12672 )
Change subject: IMPALA-8143: Enhance DoRpcWithRetry(). ...................................................................... Patch Set 9: (11 comments) http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.h File be/src/rpc/rpc-mgr-test.h: http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.h@134 PS9, Line 134: get_current_queue_size Coding style suggests this should be "GetCurrentQueueSize()". With that said, it may not be needed after all if you take the suggsetion in rpc-mgr-test.cc. http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.h@295 PS9, Line 295: kudu::Status::IOError( : Substitute("ping failing, number of calls=$0.", number_of_calls_)); Instead of IOError, can we simulate the server busy error here ? Please see comments in rpc-mgr-test.cc for reason. http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc File be/src/rpc/rpc-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@269 PS9, Line 269: until typo http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@285 PS9, Line 285: // Find the counter which tracks the number of times the service queue is too full. : const string& overflow_count = Substitute( : ImpalaServicePool::RPC_QUEUE_OVERFLOW_METRIC_KEY, ping_service->service_name()); : IntCounter* overflow_metric = : ExecEnv::GetInstance()->rpc_metrics()->FindMetricForTesting<IntCounter>( : overflow_count); : ASSERT_TRUE(overflow_metric != nullptr); : // There has been no activity yet so this should be 0. : EXPECT_EQ(overflow_metric->GetValue(), 0L); : : // Make a proxy for Ping rpc calls. : unique_ptr<PingServiceProxy> proxy; : ASSERT_OK(ping_service->GetProxy(krpc_address_, FLAGS_hostname, &proxy)); : TQueryCtx query_ctx; : : // Do an async call that will run in the service's single thread. : RpcController async_controller; : CountingBarrier async1_done(1); : PingRequestPB request1; : PingResponsePB response1; : ResponseCallback async1_callback = [&]() { async1_done.Notify(); }; : proxy->PingAsync(request1, &response1, &async_controller, async1_callback); : : // wait for the async call to be running. : bool timed_out = false; : call1_started.Wait(10 * MILLIS_PER_SEC, &timed_out); : ASSERT_FALSE(timed_out); : ASSERT_EQ(0, get_current_queue_size(rpc_mgr_)); : // Now the async call is running in the service. : : // Do a second async call, this will set in the queue, which will be full. : RpcController async_controller2; : CountingBarrier async2_done(1); : // The second call only has a short sleep. : PingRequestPB request2; : PingResponsePB response2; : // Set the deadline to something reasonable otherwise the pings from DoRpcWithRetry : // will replace this async call in the queue. : MonoTime deadline = MonoTime::Now() + MonoDelta::FromSeconds(10); : async_controller2.set_deadline(deadline); : ResponseCallback async2_callback = [&]() { async2_done.Notify(); }; : proxy->PingAsync(request2, &response2, &async_controller2, async2_callback); : : // Wait for the second async call to get queued. : int num_sleeps = 20; : while (get_current_queue_size(rpc_mgr_) != 1) { : SleepForMs(300); : if (--num_sleeps < 0) { : FAIL() << "Waited too long for async message to be queued"; : } : } : // Now the second async call is queued, and the queue is full. : : // Assert that there have been no overflows yet. : EXPECT_EQ(overflow_metric->GetValue(), 0L); : : // Call DoRpcWithRetry with retry on our busy service. : // This won't succeed, but we can observe that retries occurred. : const int retries = 3; : PingRequestPB request3; : PingResponsePB response3; : impala::Status rpc_status = RpcMgr::DoRpcWithRetry(proxy, &PingServiceProxy::Ping, : request3, &response3, query_ctx, "ping failed", retries, 100 * MILLIS_PER_SEC); : EXPECT_ERROR(rpc_status, TErrorCode::GENERAL); : EXPECT_STR_CONTAINS(rpc_status.msg().msg(), : "dropped due to backpressure. The service queue contains 1 items out of a maximum " : "of 1; memory consumption is "); : ASSERT_EQ(overflow_metric->GetValue(), retries); Stepping back a bit, it's unclear the exact purposes of these tests are. It appears that these tests aim to exercise (1) the retry path in the client code and (2) the "server busy" code in the Kudu server. For (2), we already have SlowCallback() so I don't see the need for another test for it. So, it seems that we only need a test which exercises the retry loop logic in the client code. In which case, please see comments below on how the test below can be modified slightly to provide sufficient coverage. If you agree with the suggestion below, we can delete these tests and any auxiliary functions added to support them. http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@376 PS9, Line 376: ASSERT_FALSE(inject_status.ok()); : EXPECT_ERROR(inject_status, TErrorCode::INTERNAL_ERROR); : ASSERT_EQ("Debug Action: DoRpcWithRetry:FAIL", inject_status.msg().msg()); Not entirely sure this test is needed as the EE test should exercise it already. http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr-test.cc@380 PS9, Line 380: // Test how DoRpcWithRetry retries by using a proxy that always fails. : unique_ptr<FailingPingServiceProxy> failing_proxy = : make_unique<FailingPingServiceProxy>(); : // A call that fails is not retried as the server is not busy. : PingRequestPB request6; : PingResponsePB response6; : Status rpc_status_fail = : RpcMgr::DoRpcWithRetry(failing_proxy, &FailingPingServiceProxy::Ping, request6, : &response6, query_ctx, "ping failed", 3, 10 * MILLIS_PER_SEC); : ASSERT_FALSE(rpc_status_fail.ok()); : // Assert that proxy was only called once. : ASSERT_EQ(1, failing_proxy->getNumberOfCalls()); This test seems to be sufficient for the coverage we need if we make the following modifications: - In FailingPingService (which could be renamed to CountingPingService), it can define an RPC which takes an integer as the RPC parameter. If the number of cumulative calls to that service is below that specified number, the call will return with the ServerBusy error status (instead of IOError). - In the test, issue two different RPC calls to the CountingService: one with a parameter large enough that the call will fail after retrying n times and another subsequent RPC call which will succeed after n+m-1 times where n and m are the number of retries specified for each RPC calls respectively. In this way, we will guarantee that the retry loop in the client code is exercised. http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr.inline.h File be/src/rpc/rpc-mgr.inline.h: http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/rpc/rpc-mgr.inline.h@75 PS9, Line 75: // If the call succeeded then return. : if (rpc_status.ok()) break; : : // If server is not busy then give up and return result to caller. : if (!RpcMgr::IsServerTooBusy(rpc_controller)) break; if (rpc_status.ok() || !RpcMgr::IsServerTooBusy(rpc_controller)) { return rpc_status; } http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/runtime/coordinator-backend-state.cc@465 PS9, Line 465: 3, 10 * MILLIS_PER_SEC, : 3 * MILLIS_PER_SEC, nit: May make for better documentation if we have these couple of parameters defined as variables like the following: const int num_retries = 3; const int64_t timeout_ms = 10 * MILLIS_PER_SEC; const int64_t backoff_time_ms = 3 * MILLIS_PER_SEC; http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/service/client-request-state.cc@684 PS9, Line 684: 3, 10 * MILLIS_PER_SEC, : 3 * MILLIS_PER_SEC nit: May make for better documentation if we have these couple of parameters defined as variables like the following: const int num_retries = 3; const int64_t timeout_ms = 10 * MILLIS_PER_SEC; const int64_t backoff_time_ms = 3 * MILLIS_PER_SEC; http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/testutil/fault-injection-util.h File be/src/testutil/fault-injection-util.h: http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/testutil/fault-injection-util.h@37 PS9, Line 37: RPC_TRANSMITDATA, This can also be removed. http://gerrit.cloudera.org:8080/#/c/12672/9/be/src/testutil/fault-injection-util.h@40 PS9, Line 40: // If you add a new RpcCallType then bump the "--fault_injection_rpc_type=X" parameter : // in the test test_random_rpc_timeout() in test_rpc_timeout.py. Looks like we forgot to update the tests when adding fault injection for remote shutdowns. Thanks for the new comment. -- To view, visit http://gerrit.cloudera.org:8080/12672 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia9693151c35e02235665b3c285a48c585973d390 Gerrit-Change-Number: 12672 Gerrit-PatchSet: 9 Gerrit-Owner: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Thu, 11 Apr 2019 01:12:33 +0000 Gerrit-HasComments: Yes