Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16412 )

Change subject: IMPALA-9930 (part 2): Introduce new admission control rpc 
service
......................................................................


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16412/7/be/src/scheduling/remote-admission-control-client.cc
File be/src/scheduling/remote-admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/16412/7/be/src/scheduling/remote-admission-control-client.cc@121
PS7, Line 121:     admit_status = Status(get_status_resp.status());
> Is there a JIRA for this? Might be an obstacle for production-readiness for
As discussed, switched to long polling which should mean this only ever comes 
into play for queries that are queued for a non-trivial amount of time, in 
which case its somewhat less important. Also made it configurable just for good 
measure.


http://gerrit.cloudera.org:8080/#/c/16412/7/be/src/scheduling/remote-admission-control-client.cc@184
PS7, Line 184:   ReleaseQueryBackendsResponsePB resp;
> Just thinking, but I think something like retrying the RPC then giving up a
Added retries here and for the other similar rpcs here.

As mentioned elsewhere, I think the cleanup mechanism will be a KeepAlive rpc 
that will be added in followup work.


http://gerrit.cloudera.org:8080/#/c/16412/7/common/protobuf/admission_control_service.proto
File common/protobuf/admission_control_service.proto:

http://gerrit.cloudera.org:8080/#/c/16412/7/common/protobuf/admission_control_service.proto@233
PS7, Line 233:   /// admission and the query getting released.
> Does this return immediately and require the client to throttle polling, or
As discussed offline, decided to go with long-polling.

For fault tolerance, I think the plan is to add a specific KeepAlive rpc which 
coordinators will periodically send to the admission controller with a list of 
active query ids. That should be pretty light weight and cover any issues from 
the coordinator failing or from small rpc errors that get the coordinator and 
admission controller out of sync.

For failures of the admission controller itself, I think the plan is to store 
the info needed to reconstruct the admission controller's state in the 
statestored.


http://gerrit.cloudera.org:8080/#/c/16412/7/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/16412/7/tests/custom_cluster/test_admission_controller.py@1303
PS7, Line 1303: class 
TestAdmissionControllerWithACService(TestAdmissionController):
> How much time does this add to exhaustive tests?
Because of the reduction in dimensions for the stress test, this patch actually 
reduces the time to run this test file in exhaustive - from ~78 minutes to ~62 
minutes (on my machine).



--
To view, visit http://gerrit.cloudera.org:8080/16412
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I594fc593a27b24b6952e381a9bc1a9a5c6b757ae
Gerrit-Change-Number: 16412
Gerrit-PatchSet: 8
Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Nov 2020 22:46:48 +0000
Gerrit-HasComments: Yes

Reply via email to