Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16412 )
Change subject: IMPALA-9930 (part 2): Introduce new admission control rpc service ...................................................................... Patch Set 8: (4 comments) http://gerrit.cloudera.org:8080/#/c/16412/7/be/src/scheduling/remote-admission-control-client.cc File be/src/scheduling/remote-admission-control-client.cc: http://gerrit.cloudera.org:8080/#/c/16412/7/be/src/scheduling/remote-admission-control-client.cc@121 PS7, Line 121: admit_status = Status(get_status_resp.status()); > Is there a JIRA for this? Might be an obstacle for production-readiness for As discussed, switched to long polling which should mean this only ever comes into play for queries that are queued for a non-trivial amount of time, in which case its somewhat less important. Also made it configurable just for good measure. http://gerrit.cloudera.org:8080/#/c/16412/7/be/src/scheduling/remote-admission-control-client.cc@184 PS7, Line 184: ReleaseQueryBackendsResponsePB resp; > Just thinking, but I think something like retrying the RPC then giving up a Added retries here and for the other similar rpcs here. As mentioned elsewhere, I think the cleanup mechanism will be a KeepAlive rpc that will be added in followup work. http://gerrit.cloudera.org:8080/#/c/16412/7/common/protobuf/admission_control_service.proto File common/protobuf/admission_control_service.proto: http://gerrit.cloudera.org:8080/#/c/16412/7/common/protobuf/admission_control_service.proto@233 PS7, Line 233: /// admission and the query getting released. > Does this return immediately and require the client to throttle polling, or As discussed offline, decided to go with long-polling. For fault tolerance, I think the plan is to add a specific KeepAlive rpc which coordinators will periodically send to the admission controller with a list of active query ids. That should be pretty light weight and cover any issues from the coordinator failing or from small rpc errors that get the coordinator and admission controller out of sync. For failures of the admission controller itself, I think the plan is to store the info needed to reconstruct the admission controller's state in the statestored. http://gerrit.cloudera.org:8080/#/c/16412/7/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/16412/7/tests/custom_cluster/test_admission_controller.py@1303 PS7, Line 1303: class TestAdmissionControllerWithACService(TestAdmissionController): > How much time does this add to exhaustive tests? Because of the reduction in dimensions for the stress test, this patch actually reduces the time to run this test file in exhaustive - from ~78 minutes to ~62 minutes (on my machine). -- To view, visit http://gerrit.cloudera.org:8080/16412 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I594fc593a27b24b6952e381a9bc1a9a5c6b757ae Gerrit-Change-Number: 16412 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Wed, 18 Nov 2020 22:46:48 +0000 Gerrit-HasComments: Yes