Hello Thomas Tauber-Marshall, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/14824 to look at the new patch set (#23). Change subject: IMPALA-9199: Add support for single query retries on cluster membership changes ...................................................................... IMPALA-9199: Add support for single query retries on cluster membership changes Adds the core logic for transparently retrying queries that fail due to cluster membership changes (IMPALA-9124). Query retries are triggered if (1) a node has been removed from the cluster membership by a statestore update (rather than cancelling all queries running on the removed node, queries are retried), or (2) if a query fails and as a result, blacklists a node. Either event is considered a cluster membership change as it affects what nodes a query will be scheduled on. The assumption is that a retry of the query with the updated cluster membership will succeed. A query retry is modelled as a brand new query, with its own query id. This simplifies the implementation and the resulting runtime profiles when queries are retried. Core Features: * Retries are transparent to the user; no modification to client libraries are necessary to support query retries * Retried queries skip all fe/ parsing, planning, authorization, etc. * Retries are configurable ('retry_failed_queries') and are off by default Implementation: * When a query is retried, the original query is cancelled, the new query is created, registered, and started, and then the original query is closed * A new layer of abstraction between the ImpalaServer and ClientRequestState has been added; it is called the QueryDriver * Each ClientRequestState is treated as a single attempt of a query, and the QueryDriver owns all ClientRequestStates for a query * ClientRequestState has a new state object called RetryState; a ClientRequestState can either be NOT_RETRIED, RETRYING, or RETRIED * The QueryDriver owns the TExecRequest for the query as well, it is re-used for each query retry Observability: * Users can tell if a query is retried using runtime profiles and the Impala Web UI * Runtime profiles of queries that fail and then are retried will have: * "Retry Status: RETRIED" * "Retry Cause: [the error that triggered the retry]" * "Retried Query Id: [the query id of the retried query]" * Runtime profiles of the retried query (e.g. the second attempt of the query) will include: * "Original Query Id: [the query id of the original query]" * The Impala Web UI will list all retried queries as being in the "RETRIED" state Testing: * Added E2E tests in test_query_retries.py; looped tests for a few days * Added a stress test query_retries_stress_runner.py that runs concurrent streams of a TPC workload and randomly kills impalads * Ran the stress test with various configurations: tpch on parquet, tpcds on parquet, tpch 30 GB on parquet (one stream), tpcds 30 GB on parquet (one stream), tpch on text, tpcds on text * Ran exhaustive tests * Ran exhaustive tests with 'retry_failed_queries' set to true, no unexpected failures * Ran 30 GB TPC-DS workload on a 3 node cluster, randomly restarted impalads, and manually verified that queries were retried * Manually tested retries work with various clients, specifically the impala-shell and Hue * Ran core tests and query retry stress test against an ASAN build * Ran concurrent_select.py to stress query cancellation * Ran be/ tests against a TSAN build, filed IMPALA-9730 as a follow up Limitations: * There are several limitations that are listed out in the parent JIRA Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h A be/src/runtime/query-driver.cc A be/src/runtime/query-driver.h M be/src/service/CMakeLists.txt M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/control-service.cc M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h R be/src/service/query-driver-map.cc A be/src/service/query-driver-map.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/testutil/impalad-query-executor.cc M be/src/testutil/impalad-query-executor.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/generate_error_codes.py M tests/common/impala_cluster.py M tests/common/impala_service.py A tests/custom_cluster/test_query_retries.py A tests/stress/query_retries_stress_runner.py 28 files changed, 2,458 insertions(+), 510 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/24/14824/23 -- To view, visit http://gerrit.cloudera.org:8080/14824 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e4a0e72a9bf8ec10b91639aefd81bef17886ddd Gerrit-Change-Number: 14824 Gerrit-PatchSet: 23 Gerrit-Owner: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>