Andrew Sherman has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9155: Add recovery mechanism to admission service
......................................................................


Patch Set 1:

(10 comments)

Some more comments

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@9
PS1, Line 9: Major changes:
I wrote my own version of the commit message to see that I understand the 
change. Is this correct? Feel free to use this

All heartbeats from the coordinator already contain a unique id for the 
coordinator.
The admissiond uses this unique id to track the coordinators that have sent 
their full admission state. When the admissiond starts up (in particular when 
it restarts) it will not not service any requests from coordinators until it 
has received the full admission state of that coordinator. The trigger for 
coordinators to resend their full admission state is a new flag set in replies 
to the AdmissionHeartbeat messages that every coordinator periodically sends to 
the admissiond. If the new flag is set in the reply that the coordinator 
receives then the full admission state is sent to the admissiond.

Note that when the admissiond restarts it will service requests from any 
coordinator that has sent its full admission state. This means that 
over-admission is possible during the period when not every coordinator has 
sent its full admission state.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc
File be/src/scheduling/admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc@32
PS1, Line 32:     "Re-submit for admission due to a possible admission service 
restart";
Is there a reason it is only a "possible" restart?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc
File be/src/scheduling/admission-control-service.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@304
PS1, Line 304:     LOG(WARNING) << "Received heartbeat from unrecognized 
coord_id=" << req->coord_id();
Maybe WARNING is too high, we do expect to see this after a restart, maybe just 
make the message say that?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@332
PS1, Line 332:       LOG(INFO) << "Query id already exists. Can be lingering 
from old coord_id="
Is it possible that the admissionstate we have just received is more up-to-date 
than that in admission_state_map_ ?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@449
PS1, Line 449:   LOG(INFO) << "Adding to list of known coordinators, coord_id=" 
<< coord_id;
The size of known_coord_ids_ is interesting, maybe add it to the message.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h
File be/src/scheduling/remote-admission-control-client.h:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h@54
PS1, Line 54:   /// TODO: add info on what this does? here or in the class 
comments
Yes it seems like these methods don't have descriptions


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc
File be/src/scheduling/remote-admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@131
PS1, Line 131:         retry_admission = true;
I think this was already set to true?


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@158
PS1, Line 158:     //is network error or admisiond did not recognize this coord.
Nit: "admissiond"


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1604
PS1, Line 1604:   def __compare_admission_json(self, json1, json2):
This is a clever way to test this change.


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1606
PS1, Line 1606:     # that are updated immediately and admission changes are 
compared.
"and admission changes" is slightly unclear, maybe "is changed by admissiond"



--
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 1
Gerrit-Owner: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 May 2021 19:46:41 +0000
Gerrit-HasComments: Yes

Reply via email to