Hello Michael Ho, Thomas Marshall, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12082

to look at the new patch set (#8).

Change subject: IMPALA-7931: fix executor shutdown races
......................................................................

IMPALA-7931: fix executor shutdown races

There were two races:
* queries were terminated because of an impalad being detected
  as failed by the statestore even if the query had finished
  executing on that impalad.
* NUM_FRAGMENTS_IN_FLIGHT was used to detect the backend being
  idle, but it was decremented before the final status report
  was sent.

The fixes are:
* keep track of the backends that triggered the potential cancellation,
  and only proceed with the cancellation if the coordinator has fragments
  still executing on the backend.
* add a new metric that keeps track of the number of executing queries,
  which isn't decremented until the final status report is sent.

Also do some cleanup/improvements in this code:
* use proper error codes for some errors
* more overloads for Status::Expected()
* also add a metric for the total number of queries executed on the
  backend

Testing:
Add a new version of test_shutdown_executor with delays that
trigger both races. This test only runs in exhaustive to avoid
adding ~20s to core build time.

Ran exhaustive tests.

Looped test_restart_services overnight.

Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998
---
M be/src/common/status.cc
M be/src/common/status.h
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
A be/src/service/cancellation-work.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/generate_error_codes.py
M common/thrift/metrics.json
M tests/custom_cluster/test_restart_services.py
17 files changed, 498 insertions(+), 178 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/12082/8
--
To view, visit http://gerrit.cloudera.org:8080/12082
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998
Gerrit-Change-Number: 12082
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <thomasmarsh...@cmu.edu>

Reply via email to