Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12260 )
Change subject: IMPALA-7985: Port RemoteShutdown() to KRPC. ...................................................................... Patch Set 2: (11 comments) http://gerrit.cloudera.org:8080/#/c/12260/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12260/1//COMMIT_MSG@13 PS1, Line 13: imapald typo http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/client-request-state.cc@653 PS2, Line 653: // KRPC relies on resolved IP address, so convert hostname. : IpAddr ip_address; : Status ip_status = HostnameToIpAddr(request.backend.hostname, &ip_address); : if (!ip_status.ok()) { : VLOG(1) << "Could not convert hostname " << request.backend.hostname : << " to ip address, error: " << ip_status.GetDetail(); : return ip_status; : } : TNetworkAddress addr = MakeNetworkAddress(ip_address, port); Instead of resolving the IP for the backend, it seems more consistent to re-use some of the logic at the scheduler which already contains a map of all the backend state: https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler.cc#L354-L357 May need some refactoring in the scheduler code to expose the necessary backend states. It's unfortunate that the existing 'executor_config_' uses 'hostname, be_port' as lookup key. May be it makes sense to implement a new function which looks up if there exists a known backend with 'hostname and krpc_address which exposes the port specified' and if so, returns the krpc_address_. If there isn't a known backend which maps to the specified 'host + krpc port', then we should just fail the shutdown request without making the actual RPC. Please feel free to come up with other ideas too. http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/client-request-state.cc@689 PS2, Line 689: if (msg.find("RemoteShutdown() RPC failed: Timed out: connection negotiation to") : != string::npos) { If we take the route of using existing backend states suggested above, the case of invalid port should be impossible so we shouldn't need to call this out like below here. http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/control-service.h File be/src/service/control-service.h: http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/control-service.h@79 PS2, Line 79: /// Retry the Rpc 'rpc_call' up to 3 times. : /// Pass 'debug_action' to DebugAction() to potentially inject errors. : template <typename F> : static Status DoRpcWithRetry(F&& rpc_call, const TQueryCtx& query_ctx, : const char* debug_action, const char* error_msg) { Not specifically related to this change but it may be useful to factor this retry function to generic RPC code in rpc-mgr.inline.h. Also, if the RPC fails due to being overloaded, it may make sense to have some sleep time between retries. FWIW, we had some retry logic in KrpcDataStreamSender too and I believe we also had similar retry logic for the status reporting code. It may be worth the effort to consider refactoring these code so we don't implement the same retry logic at multiple places: https://github.com/apache/impala/blob/master/be/src/runtime/krpc-data-stream-sender.cc#L374-L386 https://github.com/apache/impala/blob/master/be/src/runtime/query-state.cc#L332-L336 http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/control-service.h@87 PS2, Line 87: MonoDelta::FromSeconds(10) Not sure if it makes sense to hard code the timeout here as it seems to be utility function used by different RPC calls. Would it be better for the timeout to be one of the function parameter ? At the minimum, if we decide that 10 seconds is the timeout for all RPCs in ControlService, we should at least define it as some constant or #define to make it more explicit. I have a similar question about the number of retries too (which is hardcoded to 3). http://gerrit.cloudera.org:8080/#/c/12260/1/be/src/service/control-service.h File be/src/service/control-service.h: http://gerrit.cloudera.org:8080/#/c/12260/1/be/src/service/control-service.h@81 PS1, Line 81: name F> typo ? DoRpcWithRetry ? http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/control-service.cc File be/src/service/control-service.cc: http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/control-service.cc@182 PS2, Line 182: class Why is this needed ? http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/control-service.cc@183 PS2, Line 183: class Why is this needed ? http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/impala-server.cc@2435 PS2, Line 2435: "queries registered on coordinator: $2, queries executing: $3, " The original formatting seems fine to me. http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/impala-server.cc@2481 PS2, Line 2481: if (set_deadline) : shutdown_status->set_deadline_remaining_ms(relative_deadline_s * 1000L); Coding style nit: if (set_deadline) { .... } http://gerrit.cloudera.org:8080/#/c/12260/2/be/src/service/impala-server.cc@2489 PS2, Line 2489: ShutdownStatusPB const ShutdownStatusPB& -- To view, visit http://gerrit.cloudera.org:8080/12260 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4fd00ee4e638f5e71e27893162fd65501ef9e74e Gerrit-Change-Number: 12260 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Comment-Date: Tue, 29 Jan 2019 07:56:21 +0000 Gerrit-HasComments: Yes