Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/15154 to look at the new patch set (#5). Change subject: IMPALA-8712: Make ExecQueryFInstances async ...................................................................... IMPALA-8712: Make ExecQueryFInstances async This patch refactors the ExecQueryFInstances rpc to be asychronous. Previously, Impala would issue all the Exec()s, wait for all of them to complete, and then check if any of them resulted in an error. We now stop issuing Exec()s and cancel any that are still in flight as soon as an error occurs. It also performs some cleanup around the thread safety of Coordinator::BackendState, including adding comments and DCHECKS. === Exec RPC Thread Pool === This patch also removes the 'exec_rpc_thread_pool_' from ExecEnv. This thread pool was used to partially simulate async Exec() prior to the switch to KRPC, which provides built-in async rpc capabilities. Removing this thread pool has potential performance implications, as it means that the Exec() parameters are serialized in serialize rather than in parallel (with the level of parallelism determined by the size of the thread pool, which was configurable by an Advanced flag and defaulted to 12). To ensure we don't regress query startup times, I did some performance testing. All tests were done on a 10 node cluster. The baseline used for the tests did not include IMPALA-9181, a perf optimization for query startup done to facilitate this work. I ran TPCH 100 at concurrency levels of 1, 4, and 8 and extracted the query startup times from the profiles. For each concurrency level, the average regression in query startup time was < 2ms. Because query e2e running time was much longer than this, there was no noticable change in total query time. I also ran a 'worst case scenario' with a table with 10,000 pertitions to create a very large Exec() payload to serialize (~1.21MB vs. ~10KB-30KB for TPCH 100). Again, change in query startup time was neglible. ============================ Testing: - Added a e2e test that verifies that a query where an Exec() fails doesn't wait for all Exec()s to complete before cancelling and returning the error to the client. Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32 --- M be/src/common/global-flags.cc M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/query-state.cc M be/src/util/counting-barrier.h M tests/custom_cluster/test_rpc_exception.py M tests/failure/test_failpoints.py 11 files changed, 444 insertions(+), 266 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/15154/5 -- To view, visit http://gerrit.cloudera.org:8080/15154 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I33ec96e5885af094c294cd3a76c242995263ba32 Gerrit-Change-Number: 15154 Gerrit-PatchSet: 5 Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Joe McDonnell <joemcdonn...@cloudera.com> Gerrit-Reviewer: Sahil Takiar <stak...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>