Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12049 )
Change subject: IMPALA-4555: Make QueryState's status reporting more robust ...................................................................... Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/12049/2/be/src/runtime/query-state.cc@563 PS2, Line 563: FLAGS_status_report_interval_ms * (num_failed_reports_ + 1))) { > This backoff time seems way larger than the retry interval used when failin Some more thoughts on this: The behavior in this patch is that we will back off on the intermediate reports with at least 5 seconds (or whatever FLAGS_status_report_interval_ms is set to) but then we only spin for about 300ms (or whatever FLAGS_report_status_retry_interval_ms is set to) when it fails to be sent. This behavior seems rather complicated without much benefit. Having 2 different flags controlling the behavior also makes it hard to use and reason about the behavior. An executor is merely using this periodic status report as a proxy to determine the liveness of the coordinator. So, wouldn't the behavior be simpler and more consistent if an executor just considers the coordinator as dysfunctional after FLAGS_status_report_max_retries failed reports in a row. In other words, an executor provides a total time allowance of FLAGS_status_report_max_retries * FLAGS_report_interval_ms (or potentially longer with exponential backoff) for coordinator to respond before giving up. With the above policy, we don't distinguish between intermediate report and final report for the backoff behavior. A clock will be started on the first failure and an executor gives up after the time allowance has passed. As a side note, an executor should ideally be able to rely on statestore to provide proper cluster membership but the reality is that it's insufficient for determining the health of a node (see IMPALA-7872). So, we fall back to this poor-man approach of using periodic status report as a heartbeat to probe the coordinator. -- To view, visit http://gerrit.cloudera.org:8080/12049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971 Gerrit-Change-Number: 12049 Gerrit-PatchSet: 2 Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Thu, 20 Dec 2018 20:15:17 +0000 Gerrit-HasComments: Yes