Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10060 )
Change subject: [Preview]IMPALA-5216: Make admission control queuing async ...................................................................... Patch Set 2: (25 comments) http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/runtime/coordinator.h@329 PS2, Line 329: bool execution_started_ = false; leftover? http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.h File be/src/scheduling/admission-controller.h: http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.h@45 PS2, Line 45: AdmissionStatus Maybe it should be AdmissionOutcome? This is really the final outcome of admission rather than the intermediate states, right? http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.h@199 PS2, Line 199: /// returns an OK status, schedule->is_admitted() is true and admit_status is ADMITTED. Is (admit_status.IsSet() && admit_status.Get() == ADMITTED) equivalent to schedule->is_admitted()? If so we should just remove the redundancy somehow. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.h@395 PS2, Line 395: Promise<AdmissionStatus>* admit_status; What owns the memory of admit_status? http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/scheduling/admission-controller.cc@518 PS1, Line 518: admit_status->TrySet(AdmissionStatus::REJECTED_OR_TIMED_OUT); > This was my initial intuition too, but like you mentioned, I saw that sched Yeah, I actually think it would be good to remove the is_admitted_ flag out of QuerySchedule to make it more consistent. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.cc@505 PS2, Line 505: lock_guard<mutex> lock(admission_ctrl_lock_); Do we need to worry about races between admit_status being set to CANCELLED and it being set below? What happens if the TrySet(ADMITTED) call below fails because the query was cancelled? Are the admission results, etc correct? http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.cc@576 PS2, Line 576: admit_status->TrySet(AdmissionStatus::REJECTED_OR_TIMED_OUT); It would be nice to log the outcome of admission in all cases. Maybe we should have a class that is a wrapper around the promise and guarantees that the outcome of admission is logged? http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.cc@595 PS2, Line 595: if (queue->Remove(&queue_node)) { One line? http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.cc@885 PS2, Line 885: // TODO: Maybe dont even check cancelled here, just try admitting it and let the +1 to removing a code path if it isn't totally necessary. As long as we catch it before starting fragments it seems ok to detect the cancellation slightly later. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/scheduling/admission-controller.cc@901 PS2, Line 901: DCHECK We should really log the actual value if the DCHECK fails. I wish we have a convenience macro for comparing two strongly-typed enums. Might be worth adding? http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/scheduling/query-schedule.h File be/src/scheduling/query-schedule.h: http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/scheduling/query-schedule.h@274 PS1, Line 274: > the only reason I kept this was because the coordinator has a const referen Can the coordinator have a reference to Promise<AdmitStatus> then? Having the redundant state and inconsistency between them seems confusing. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.h@279 PS2, Line 279: /// admit or reject or mark timed out) or on cancellation of query (to mark cancelled). Maybe slightly expand on how cancellation works. I think a few important things are: * It can be called from client-RPC-handling threads. * Cancellation is performed by AdmitQuery() or DequeueLoop() asynchronously. * TrySet(CANCELLED) ensures async cancellation of the query is queued or about to be queued or is a no-op if the query has already been admitted. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.h@282 PS2, Line 282: Promise<AdmissionStatus> admit_status_; After reading through the rest of the patch I feel like it might be beneficial to have a lightweight abstraction around admit_status_ to force callers to use it in the right way. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.h@318 PS2, Line 318: Exec Exec() http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/service/client-request-state.h@316 PS1, Line 316: /// after Exec() has been successfully called on it (that is, if 'coord_exec_started_' > I see. would you suggest I create an AtomicBool class like what we have for I think that would work. It would also be ok if it was a wrapper around std::atomic<bool> that only exposed a subset of methods. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@906 PS2, Line 906: if(uses_admission_control()){ Nit: missing whitespace. Fits on one line? http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@909 PS2, Line 909: coordinator = coord(); Does this line need to be in the critical section? It seems like we don't need the coordinator until l920 anyway so we could initialise the variable down there. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1121 PS2, Line 1121: nullptr nullptr arg not needed http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1146 PS2, Line 1146: if (exec_env_->admission_controller() != nullptr) { Let's remove the exec_env_ member and use ExecEnv::GetInstance() while we're here. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1147 PS2, Line 1147: status It's a bit hard to follow the control flow around 'status'. I wonder if it would be clearer to have separate variables for the outcome of each status-returning operation, since it seems like the result of AdmitQuery() is handled differently from the result of coord->Exec(). http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1151 PS2, Line 1151: if (int sleep_ms = GetSleepTimeFromDebugOptions( We generally avoid assigning variables inside if statements. This isn't too bad but it seems better just to initialize the variable in a separate statement. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1159 PS2, Line 1159: // TODO: this is a very small window in practice, not sure if its worth the So it's very unlikely that this would save us work? It looks like it could save us a lot of work in coord->Exec() but if it's very unlikely to hit it then we can just remove it. I think it's more important to keep the code simple and correct than it is to minimise the latency of cancellation. Maybe we can think about this in terms of latency of cancellation? Cancellation generally would be user or admin-initiated and I think having a 1s latency between cancellation being initiated and it taking effect would be acceptable. Lower is of course better. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1178 PS2, Line 1178: if (int sleep_ms = GetSleepTimeFromDebugOptions( Same comment about if statement. http://gerrit.cloudera.org:8080/#/c/10060/2/be/src/service/client-request-state.cc@1198 PS2, Line 1198: if(cancelled) { whitespace http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/10060/1/be/src/service/impala-hs2-server.cc@465 PS1, Line 465: // TODO: Does it make sense to change "in_flight_queries" to "initialized_queries". > Agreed. "submitted_queries" sounds perfect. However I will make changes in I'm ok with deferring the work to a later cleanup patch too so long as we document inflight_queries_ with a comment that explains what it means :) -- To view, visit http://gerrit.cloudera.org:8080/10060 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I989cf5b259afb8f5bc5c35590c94961c81ce88bf Gerrit-Change-Number: 10060 Gerrit-PatchSet: 2 Gerrit-Owner: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Thu, 19 Apr 2018 23:09:59 +0000 Gerrit-HasComments: Yes