Hello Michael Ho, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/11339 to look at the new patch set (#3). Change subject: IMPALA-7464: fix race when ExecFInstance() RPC times out ...................................................................... IMPALA-7464: fix race when ExecFInstance() RPC times out The "exec resources" reference count on the QueryState expects that it will transition from 0 -> non-zero -> 0 at most once. The reference count is taken on the coordinator side (sender of this RPC) and also the backend (receiver of this RPC). Usually, the lifetimes of those references overlap (the coordinator won't give up the reference until the backend execution is complete or failed), and so the assumption is not violated. However, when the RPC times out, the receiver may run after the sender has given up its reference (since the sender doesn't know the receiver is actually still executing). As it turns out, the coordinator doesn't really need to take a reference given the current code (verified via code inspection), as these resources are backend-only). So, stop taking the reference on the coordinator side, and add some DCHECKs to document that (the dchecks aren't particularly good at verifying it, however, since the lifetimes generally will overlap). Note that this patch can't be easily backported to older versions without careful inspection since older versions of the code may have relied on the reference count protecting things used by the coordinator. Testing: - New test_rpc_timeout case that reproduced the problem 100% - exhaustive build Change-Id: If60d983e0e68b00e6557185db1f86757ab8b3f2d --- M be/src/benchmarks/process-wide-locks-benchmark.cc M be/src/runtime/coordinator.cc M be/src/runtime/query-exec-mgr.cc M be/src/runtime/query-state.cc M be/src/runtime/query-state.h M be/src/runtime/runtime-state.cc M be/src/runtime/test-env.cc M tests/custom_cluster/test_rpc_timeout.py 8 files changed, 76 insertions(+), 50 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/11339/3 -- To view, visit http://gerrit.cloudera.org:8080/11339 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If60d983e0e68b00e6557185db1f86757ab8b3f2d Gerrit-Change-Number: 11339 Gerrit-PatchSet: 3 Gerrit-Owner: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>