Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#18). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout ...................................................................... IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call RetryRpcRecv() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response to add back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/common/global-flags.cc M be/src/rpc/thrift-util.cc M be/src/rpc/thrift-util.h M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/service/fragment-exec-state.cc M be/src/service/fragment-mgr.cc M be/src/service/impala-server.cc M be/src/statestore/statestore.cc A be/src/testutil/fault-injection-util.h M be/src/util/error-util-test.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 18 files changed, 378 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/18 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 18 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>