Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12299 )

Change subject: IMPALA-2990: timeout unresponsive queries in coordinator
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.h@137
PS2, Line 137: boost::lock_guard<boost::mutex> l(lock_);
In theory, this doesn't need any locking as read / write of a 64-bit word 
should be atomic as long as its' 64-bit aligned.


http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.h@153
PS2, Line 153:   /// Returns a timestamp for use in tracking arrival of status 
reports.
May help to state explicitly that this uses the monotonic time instead of Unix 
time.


http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator-backend-state.cc@89
PS2, Line 89: last_report_time_ms_ = GenerateReportTimestamp();
Should this clock not be started until we have issued all the initial 
ExecFInstance() RPC ? I can see there may be cases in which starting a query in 
a large cluster can take a bit of time so we may be more prone to false 
positive.


http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/runtime/coordinator.cc@747
PS2, Line 747: for (BackendState* backend_state : backend_states_) {
Is there any race with the init code which populates backend_states_ ? 
Shouldn't the detection be skipped until the backend_states_ be fully populated 
?

Does this handle the case in which a backend is done so no more new report will 
be received ?


http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/12299/2/be/src/service/impala-server.cc@215
PS2, Line 215: (max_report_lag_s, 30,
Shouldn't the lag be dependent on the reporting interval ? Ideally, we should 
mark the query as hung after x * reporting interval has elapsed without 
receiving a report.

In addition, we need to handle the case of periodic reporting being disabled by 
the user. In that case, we probably need to disable this detection.



--
To view, visit http://gerrit.cloudera.org:8080/12299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I196c8c6a5633b1960e2c3a3884777be9b3824987
Gerrit-Change-Number: 12299
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu>
Gerrit-Comment-Date: Thu, 31 Jan 2019 20:18:42 +0000
Gerrit-HasComments: Yes

Reply via email to