Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12299 )
Change subject: IMPALA-2990: timeout unresponsive queries in coordinator ...................................................................... Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.h File be/src/runtime/coordinator-backend-state.h: http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.h@137 PS2, Line 137: boost::lock_guard<boost::mutex> l(lock_); In theory, this doesn't need any locking as read / write of a 64-bit word should be atomic as long as its' 64-bit aligned. http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.h@153 PS2, Line 153: /// Returns a timestamp for use in tracking arrival of status reports. May help to state explicitly that this uses the monotonic time instead of Unix time. http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.cc@89 PS2, Line 89: last_report_time_ms_ = GenerateReportTimestamp(); Should this clock not be started until we have issued all the initial ExecFInstance() RPC ? I can see there may be cases in which starting a query in a large cluster can take a bit of time so we may be more prone to false positive. http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator.cc@747 PS2, Line 747: for (BackendState* backend_state : backend_states_) { Is there any race with the init code which populates backend_states_ ? Shouldn't the detection be skipped until the backend_states_ be fully populated ? Does this handle the case in which a backend is done so no more new report will be received ? http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/service/impala-server.cc@215 PS2, Line 215: (max_report_lag_s, 30, Shouldn't the lag be dependent on the reporting interval ? Ideally, we should mark the query as hung after x * reporting interval has elapsed without receiving a report. In addition, we need to handle the case of periodic reporting being disabled by the user. In that case, we probably need to disable this detection. -- To view, visit http://gerrit.cloudera.org:8080/12299 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987 Gerrit-Change-Number: 12299 Gerrit-PatchSet: 2 Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Comment-Date: Thu, 31 Jan 2019 20:18:42 +0000 Gerrit-HasComments: Yes