Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/12672 )
Change subject: IMPALA-8143: Enhance DoRpcWithRetry(). ...................................................................... Patch Set 5: (20 comments) Thanks for the reviews. Please see patch set 7. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.h File be/src/rpc/rpc-mgr-test.h: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.h@296 PS5, Line 296: tha > typo Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc File be/src/rpc/rpc-mgr-test.cc: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@214 PS5, Line 214: ASSERT_OK(static_cast<ScanMemServiceImpl*>(scan_mem_impl) : -> > Not sure if it's output of clang-tidy or something but I find it a bit hard Yes this is clang-format, but I see now how to avoid this getting "fixed". http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@263 PS5, Line 263: shoudl > typo Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@291 PS5, Line 291: NULL > nullptr; Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@298 PS5, Line 298: RpcController controller; > not used ? yes! http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@316 PS5, Line 316: sleep_ms = 100; > Instead of sharing "sleep_ms" between RPCs, it may be cleaner to pass the s Thanks, this is cleaner http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@322 PS5, Line 322: proxy.get()-> > proxy-> Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@333 PS5, Line 333: // Call DoRpcWithRetry with no backoff on our busy service. : const int retries = 3; : impala::Status rpc_status = RpcMgr::DoRpcWithRetry(proxy, &PingServiceProxy::Ping, : request, &response, query_ctx, "ping failed", retries, MonoDelta::FromSeconds(100)); : EXPECT_ERROR(rpc_status, TErrorCode::GENERAL); : EXPECT_STR_CONTAINS(rpc_status.msg().msg(), : "dropped due to backpressure. The service queue contains 1 items out of a maximum " : "of 1; memory consumption is "); : ASSERT_EQ(overflow_metric->GetValue(), retries); : : // Call DoRpcWithRetry, but this time we do backoff on a busy service, and this waiting : // allows the outstanding aysnc calls to complete, so that then this call will succeed. : impala::Status rpc_status_backoff = : RpcMgr::DoRpcWithRetry(proxy, &PingServiceProxy::Ping, request, &response, : query_ctx, "ping failed", 10, MonoDelta::FromSeconds(100), 2); : ASSERT_OK(rpc_status_backoff); : ASSERT_GT(overflow_metric->GetValue(), retries); : : // Check that async calls did complete. : ASSERT_FALSE(async1_done.pending()); : ASSERT_FALSE(async2_done.pending()); > These set of tests seem rather timing dependent. For instance, it may occur I reviewed the test and tried to make to more bulletproof. I think any test that really does fill up the queue and test that the retry works in this case will have some asynchrony. I don't like tests with sleep calls but I think this should be reliable. Let me know if you are not persuaded. http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr-test.cc@378 PS5, Line 378: // If we fail 2x and retry 2x, then the call fails. : unique_ptr<FailingPingServiceProxy> failing_proxy2 = : make_unique<FailingPingServiceProxy>(2); : : Status rpc_status_fail = : RpcMgr::DoRpcWithRetry(failing_proxy2, &FailingPingServiceProxy::Ping, request, : &response, boost::bind(&RpcMgrTest::log, this), query_ctx, "ping failed", 2, : MonoDelta::FromSeconds(10)); > Please see comments in rpc-mgr.inline.h that we probably should only retry Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h File be/src/rpc/rpc-mgr.h: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@178 PS5, Line 178: // A no-op method that is used as part of overloading DoRpcWithRetry(). : static void logging_noop(){}; : : // The type of the log method passed to DoRpcWithRetry(). : typedef boost::function<void()> RpcLogFn; > Not needed as discussed in person. Removed http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@189 PS5, Line 189: class > typename Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@191 PS5, Line 191: rpc_call > It's important to highlight that this only works iff rpc_call is idempotent Thanks http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@193 PS5, Line 193: const kudu::MonoDelta& timeout, : const int server_busy_backoff_s = 0 > It seems inconsistent to pass the timeout in as MonoDelta while passing the Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@201 PS5, Line 201: const std::unique_ptr<Proxy>& proxy, : const ProxyMethod& rpc_call, const Request& request, Response* response, : const TQueryCtx& query_ctx, const char* error_msg, const int times_to_try, : const kudu::MonoDelta& timeout, const int server_busy_backoff_s = 0, : const char* debug_action = nullptr) > My debug action patch will eliminate the 'error_msg', 'debug_action', and ' OK I'll leave this alone and Thomas will clean up the mess? :-) http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.h@201 PS5, Line 201: const std::unique_ptr<Proxy>& proxy, : const ProxyMethod& rpc_call, const Request& request, Response* response, : const TQueryCtx& query_ctx, const char* error_msg, const int times_to_try, : const kudu::MonoDelta& timeout, const int server_busy_backoff_s = 0, : const char* debug_action = nullptr) > Although this function is inlined, It has grown over time to take too many see my answer to Thomas below http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h File be/src/rpc/rpc-mgr.inline.h: http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@57 PS5, Line 57: const char* debug_action > May make sense to group this argument next to query_ctx as they are related These arguments are at the end because the have default values. And Thomas is going to make query_ctx and debug_action go away so I'll leave these alone for now, http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@61 PS5, Line 61: i++ > ++i Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@72 PS5, Line 72: response > Do we need to reset response between retry ? My understanding is that any partial results in in the response will be overwritten during a successful call. What do you think? Should I call Clear() ? http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@76 PS5, Line 76: // Rpc failed, log it. In many cases this will be a no-op. : log(); > Based on our offline discussion, this is not needed. Please consider removi Done http://gerrit.cloudera.org:8080/#/c/12672/5/be/src/rpc/rpc-mgr.inline.h@79 PS5, Line 79: // Sleep if the server is busy and this is not the last time to try : if (server_busy_backoff_s != 0 && i != times_to_try - 1 : && RpcMgr::IsServerTooBusy(rpc_controller)) { : sleep(server_busy_backoff_s); : } > So, it's not clear to me whether we should retry for errors other than RpcM OK, well good question. On reflection I have changed this to only retry if the service is busy. Are we all happy with this? -- To view, visit http://gerrit.cloudera.org:8080/12672 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia9693151c35e02235665b3c285a48c585973d390 Gerrit-Change-Number: 12672 Gerrit-PatchSet: 5 Gerrit-Owner: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Fri, 29 Mar 2019 21:32:41 +0000 Gerrit-HasComments: Yes