Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@563
PS2, Line 563: FLAGS_status_report_interval_ms * (num_failed_reports_ + 1))) {
> This backoff time seems way larger than the retry interval used when failin
Some more thoughts on this:

The behavior in this patch is that we will back off on the intermediate reports 
with at least 5 seconds (or whatever FLAGS_status_report_interval_ms is set to) 
but then we only spin for about 300ms (or whatever 
FLAGS_report_status_retry_interval_ms is set to) when it fails to be sent. This 
behavior seems rather complicated without much benefit. Having 2 different 
flags controlling the behavior also makes it hard to use and reason about the 
behavior.

An executor is merely using this periodic status report as a proxy to determine 
the liveness of the coordinator.  So, wouldn't the behavior be simpler and more 
consistent if an executor just considers the coordinator as dysfunctional after 
 FLAGS_status_report_max_retries failed reports in a row. In other words, an 
executor provides a total time allowance of  FLAGS_status_report_max_retries * 
FLAGS_report_interval_ms (or potentially longer with exponential backoff) for 
coordinator to respond before giving up.


With the above policy, we don't distinguish between intermediate report and 
final report for the backoff behavior. A clock will be started on the first 
failure and an executor gives up after the time allowance has passed.

As a side note, an executor should ideally be able to rely on statestore to 
provide proper cluster membership but the reality is that it's insufficient for 
determining the health of a node (see IMPALA-7872). So, we fall back to this 
poor-man approach of using periodic status report as a heartbeat to probe the 
coordinator.



--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 2
Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Dec 2018 20:15:17 +0000
Gerrit-HasComments: Yes

Reply via email to