Tim Armstrong has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/12082 )
Change subject: IMPALA-7931: fix executor shutdown races ...................................................................... IMPALA-7931: fix executor shutdown races There were two races: * queries were terminated because of an impalad being detected as failed by the statestore even if the query had finished executing on that impalad. * NUM_FRAGMENTS_IN_FLIGHT was used to detect the backend being idle, but it was decremented before the final status report was sent. The fixes are: * keep track of the backends that triggered the potential cancellation, and only proceed with the cancellation if the coordinator has fragments still executing on the backend. * add a new metric that keeps track of the number of executing queries, which isn't decremented until the final status report is sent. Also do some cleanup/improvements in this code: * use proper error codes for some errors * more overloads for Status::Expected() * also add a metric for the total number of queries executed on the backend Testing: Add a new version of test_shutdown_executor with delays that trigger both races. This test only runs in exhaustive to avoid adding ~20s to core build time. Ran exhaustive tests. Looped test_restart_services overnight. Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998 --- M be/src/common/status.cc M be/src/common/status.h M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/query-exec-mgr.cc M be/src/runtime/query-state.cc A be/src/service/cancellation-work.h M be/src/service/impala-server.cc M be/src/service/impala-server.h M be/src/util/impalad-metrics.cc M be/src/util/impalad-metrics.h M common/thrift/ImpalaInternalService.thrift M common/thrift/generate_error_codes.py M common/thrift/metrics.json M tests/custom_cluster/test_restart_services.py 17 files changed, 437 insertions(+), 116 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/12082/6 -- To view, visit http://gerrit.cloudera.org:8080/12082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998 Gerrit-Change-Number: 12082 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>