[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4080 to look at the new patch set (#4). Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. IMPALA-3912: test_random_rpc_timeout is flaky. Datastream sender default timeout is 2 mins which could block some fragments to complete until the timeout and cause the metric "num-fragments-in-flight" not back to 0 after 60 seconds. Decrease the sender timeout to 30 seconds and adding some logging. Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 --- M be/src/service/fragment-mgr.cc M tests/custom_cluster/test_rpc_timeout.py 2 files changed, 9 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/4 -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/4080/3//COMMIT_MSG Commit Message: PS3, Line 11: datastream sender timeout is 2 mins > is the problem that some fragment gets blocked waiting for a receiver that Good idea, that also avoid increasing the test execution time. -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. Patch Set 3: Code-Review+1 Carry previous +1 -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4080 to look at the new patch set (#3). Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. IMPALA-3912: test_random_rpc_timeout is flaky. Query might take more than one minute to cleanup and cause the metric "num-fragments-in-flight" not back to 0. For example, datastream sender timeout is 2 mins. So increasing the metrics checking timeout to 3 mins and adding some logging. Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 --- M be/src/service/fragment-mgr.cc M tests/custom_cluster/test_rpc_timeout.py 2 files changed, 5 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/3 -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. Patch Set 2: I am not able to reproduce the test failure with this test locally and on a rhel machine. But I tried with a few other tests. I added same metrics check to test_cancellation.py and test_failpoints.py. I hit the sender timeout and the metrics check could fail with 60s timeout. -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. Patch Set 2: Code-Review+1 Carry Sailesh's +1 -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4080 to look at the new patch set (#2). Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. IMPALA-3912: test_random_rpc_timeout is flaky. Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 --- M be/src/service/fragment-mgr.cc M tests/custom_cluster/test_rpc_timeout.py 2 files changed, 5 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/2 -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/4080/1/be/src/service/fragment-mgr.cc File be/src/service/fragment-mgr.cc: Line 103: VLOG_QUERY << "PlanFragment completed. instance_id=" > nit: I think this fits into a single line. Please ignore if it doesn't Done -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. Patch Set 1: Couldn't find why the metric "num-fragments-in-flight" not back to 0. One suspect is a query might take more than one minute to cleanup. for example, datastream sender timeout is 2 mins. So increase the metrics checking timeout and added some logging. -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3912: test random rpc timeout is flaky.
Juan Yu has uploaded a new change for review. http://gerrit.cloudera.org:8080/4080 Change subject: IMPALA-3912: test_random_rpc_timeout is flaky. .. IMPALA-3912: test_random_rpc_timeout is flaky. Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 --- M be/src/service/fragment-mgr.cc M tests/custom_cluster/test_rpc_timeout.py 2 files changed, 6 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/4080/1 -- To view, visit http://gerrit.cloudera.org:8080/4080 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I19f8b3fea66c5a0398e3476a46f060be9f951983 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com>
[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3470: DecompressorTest is flaky. .. Patch Set 3: Code-Review+2 Rebase, Carry Dan's +2 -- To view, visit http://gerrit.cloudera.org:8080/3954 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.
Juan Yu has posted comments on this change. Change subject: IMPALA-3470: DecompressorTest is flaky. .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3954/1/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 237: int truncated = rand() % 512 + 1; > Can we add an ASSERT_LE(truncated, compressed_len)? Done -- To view, visit http://gerrit.cloudera.org:8080/3954 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.
Juan Yu has uploaded a new patch set (#2). Change subject: IMPALA-3470: DecompressorTest is flaky. .. IMPALA-3470: DecompressorTest is flaky. Make sure the random number is greater than 0 and stream is indeed truncated. Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a --- M be/src/util/decompress-test.cc 1 file changed, 2 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/3954/2 -- To view, visit http://gerrit.cloudera.org:8080/3954 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
[Impala-ASF-CR] IMPALA-3470: DecompressorTest is flaky.
Juan Yu has uploaded a new change for review. http://gerrit.cloudera.org:8080/3954 Change subject: IMPALA-3470: DecompressorTest is flaky. .. IMPALA-3470: DecompressorTest is flaky. Make sure the random number is greater than 0 and stream is indeed truncated. Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a --- M be/src/util/decompress-test.cc 1 file changed, 1 insertion(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/54/3954/1 -- To view, visit http://gerrit.cloudera.org:8080/3954 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I7ebaa403abf45e31f38d6cf4e557d6274d877a8a Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 21: (15 comments) http://gerrit.cloudera.org:8080/#/c/3343/21/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS21, Line 227: is > delete "is" Done Line 304: TNetworkAddress address_; > can't we get this from client_->address()? "client_" is not an instance of ThriftClientImpl http://gerrit.cloudera.org:8080/#/c/3343/21/be/src/runtime/exec-env.cc File be/src/runtime/exec-env.cc: PS21, Line 134: 30 > Is there a short comment you could write to justify how this was chosen (5 Done PS21, Line 134: The time after " : "which a backend client send/recv RPC call will timeout. > The send/recv connection timeout in milliseconds for a backend client RPC. This is the underlying TSocket send/recv call timeout, not connection timeout. PS21, Line 138: > same Done PS21, Line 157: 0 > why is this 0? (wait_ms) This is for retry opening connection, usually each retry will take several seconds. waiting even longer won't help much. PS21, Line 162: 100 > how was this chosen? I'll set this to 0. Line 223: "", !FLAGS_ssl_client_ca_certificate.empty())), > not your change, but it's really unfortunate we duplicate this code. let's I'll add a Todo here. http://gerrit.cloudera.org:8080/#/c/3343/21/be/src/testutil/fault-injection-util.h File be/src/testutil/fault-injection-util.h: PS21, Line 36: RPC_RANDOM > comment that this must be last Done PS21, Line 39: call > delete Done PS21, Line 40: timeout > this is the recv connection timeout, correct? if so, how about saying "recv Done PS21, Line 41: RpcCallType my_type, int32_t rpc_type, int32_t delay_ms > document these. Done PS21, Line 44: rpc_type == RPC_NULL > what is specifying RPC_NULL used for? Just for easy testing, you can easily enable disable the fault injection by changing the value, no need to add/remove the startup flag. In the future, we could change this value dynamically to do more testing. Line 50: FLAGS_fault_injection_rpc_type, FLAGS_fault_injection_rpc_delay_ms) > why pass these as arguments rather than just having InjectRpcDelay() read t Similar reason as above, we could test with dynamic values without the need to restart cluster. http://gerrit.cloudera.org:8080/#/c/3343/21/tests/custom_cluster/test_rpc_timeout.py File tests/custom_cluster/test_rpc_timeout.py: Line 119: self.execute_query_verify_metrics(self.TEST_QUERY, 10) > how long do all these tests take to execute? let's run them only in exhaus About 5 minutes. ok, I'll change to only execute in exhaustive mode. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 21 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 20: (3 comments) http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS20, Line 224: indicate > Right, but in the case of success, it's not really a "retry". That's just s Done Line 265: } catch (apache::thrift::TException& e) { > I'm fine with leaving this alone for now but it seems a little weird. Let me add a Todo here. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS20, Line 231: timeout_ms > It'd be preferred to remove the timeout_ms param and this extra logic until Done -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 20: (1 comment) http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-mgr.cc File be/src/service/fragment-mgr.cc: Line 38: FAULT_INJECTION_RPC_DELAY(RPC_EXECPLANFRAGMENT); > is there a reason we decided to put these here rather than in impala-intern No, I forget we have a proxy for those RPC call. You are right, it's better to put them there. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 20 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 20: (23 comments) http://gerrit.cloudera.org:8080/#/c/3343/20//COMMIT_MSG Commit Message: Line 12: Impala doesn't set socket send/recv timeout for backend client. > Prior to this change, ... Done PS20, Line 24: new RPC call RetryRpcRecv() > This makes it sound like RetryRpcRecv() is a new RPC. I haven't looked at Done PS20, Line 28: call > delete (C of RPC stands for call) Done http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/common/global-flags.cc File be/src/common/global-flags.cc: PS20, Line 102: call > delete Done PS20, Line 102: which trigger caller hits RPC timeout. > to trigger an RPC timeout on the caller side. Done http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.cc File be/src/runtime/client-cache.cc: Line 162: client_map_.erase(client); > is it legal to use the 'client' iterator even though the lock was dropped i http://kera.name/articles/2011/06/iterator-invalidation-rules/ For insert, iterator is not affected. For Erasure, iterator is only affected to the erased elements. client is not shared so there shouldn't be more than one thread try to destroy the same client. right? Line 163: *client_key = NULL; > why doesn't this method need to remove from the per_host_caches_ as well? T Connection cannot be shared. when a connection is being use, it was taken out from per_host_caches. so we just don't put it back. The header comment means client_map_. I'll clarify that. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS20, Line 96: it > what is "it"? the client? or the connection? does this NULL out *client_ke "it" means the client. DestroyClient NULL out *client_key like Release Client() PS20, Line 96: cache > is this "the per-host cache" that is referenced everywhere else, or somethi Done PS20, Line 224: indicate > indicates if returned status is okay, means both RPC are completed. can or cannot retry really depends on the data being sent here. If this is TransmiteData, retry is not safe but if this is ReportStatusCb, retry is fine. Only caller knows that info. Line 225: /// the whole RPC call. > explain this more. is it because if the send RPC call failed, then the han that's right. Line 260: if (!status.ok()) { > so this is the only path where we have *retry_is_safe == true and an error Done Line 265: } catch (apache::thrift::TException& e) { > why does this path not need the timeout check also? I think the original logic is to give it a second chance but if it fails again, just fail the whole RPC. Line 271: client_is_unrecoverable_ = false; > why is this logic needed now, whereas it wasn't needed before? or is this a yes, it's another bug fix, I need to fix it in this patch, otherwise, the broken connection will cause lots of problem. IMPALA-2864 PS20, Line 301: the > whether the see my previous patch and Henry's comment "last_rpc_succeeded_" matches the code logic better. "client_is_unrecoverable_" to more clearly describe the state not sure which one is better. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: Line 115: ImpalaBackendClientCache* client_cache_; > now that we store the RuntimeState, let's get rid of this and just get the Done PS20, Line 153: wait > retry waiting for the response? Done PS20, Line 153: call > delete "call", and would be nice to write RPC Done PS20, Line 231: timeout_ms > where is this called with a non-zero timeout_ms? For now, we like to wait indefinitely to avoid logic change. But in the future, even after replace RPC implementation, it's better to have a reasonable timeout for all RPCs. Line 239: timeout_ms)); > this will throw away the original status message. Maybe use AddDetail() in Done http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: Line 125: if (!retry_is_safe) break; > so i guess this was a bug in the old code? yes, But for ReportExecStatus RPC, resend shouldn't cause any issue. handler just process the same status multiple times. http://gerrit.cloudera.org:8080/#/c/3343/20/be/src/statestore/statestore.cc File be/src/statestore/statestore.cc: PS20, Line 658: Rewrite > Add details to Done PS20, Line 669: Rewrite > same Done -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 20 Gerrit-Project: Impa
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#19). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call RetryRpcRecv() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response to add back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/common/global-flags.cc M be/src/rpc/thrift-util.cc M be/src/rpc/thrift-util.h M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/service/fragment-exec-state.cc M be/src/service/fragment-mgr.cc M be/src/service/impala-server.cc M be/src/statestore/statestore.cc A be/src/testutil/fault-injection-util.h M be/src/util/error-util-test.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 18 files changed, 378 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/19 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 19 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#18). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call RetryRpcRecv() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response to add back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/common/global-flags.cc M be/src/rpc/thrift-util.cc M be/src/rpc/thrift-util.h M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/service/fragment-exec-state.cc M be/src/service/fragment-mgr.cc M be/src/service/impala-server.cc M be/src/statestore/statestore.cc A be/src/testutil/fault-injection-util.h M be/src/util/error-util-test.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 18 files changed, 378 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/18 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 18 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 17: (17 comments) http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS17, Line 203: if (client_ != NULL) { : if (!has_rpc_error_) { : client_cache_->ReleaseClient(_); : } else { : client_cache_->DestroyClient(_); : } : } > is this path covered by your tests - i.e. the connection will be reopened i yes, the test_random_rpc_timeout covers this. It runs multiple queries and each will hit rpc timeout, then the connection will be destroyed. The following query will create new connections. Line 301: /// Indicate the rpc call sent by this connection succeeds or not. If the rpc call fails > last rpc call? Done PS17, Line 304: has_rpc_error_ > I would call this something more descriptive, like "cnxn_not_recoverable_". How about just "last_rpc_succeed_" so it matches the logic, less confusion. http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS17, Line 233: (timeout_ms == 0) ? 0 : > can remove this, and just check timeout_ms in the while condition. Done PS17, Line 236: deadline > timeout_ms Done http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: PS17, Line 116: can_retry > call this "retry_is_safe" ? Done http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/fragment-mgr.cc File be/src/service/fragment-mgr.cc: PS17, Line 37: #ifndef NDEBUG : DECLARE_int32(fault_injection_rpc_delay_ms); : DECLARE_int32(fault_injection_rpc_type); : #define FAULT_INJECTION_RPC_DALAY(type) InjectRpcDelay(type, \ : FLAGS_fault_injection_rpc_type, FLAGS_fault_injection_rpc_delay_ms) : #else : #define FAULT_INJECTION_RPC_DALAY(type) : #endif > would it be better to put this in fault-injection-util.h? Figured out the compiling issue. Yes, better to put this in fault-injection-util.h and avoid duplicate code. http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/service/impala-server.cc File be/src/service/impala-server.cc: PS17, Line 186: DALAY > DELAY Done http://gerrit.cloudera.org:8080/#/c/3343/17/be/src/testutil/fault-injection-util.h File be/src/testutil/fault-injection-util.h: PS17, Line 15: _ > nit: remove trailing _ Done http://gerrit.cloudera.org:8080/#/c/3343/17/tests/custom_cluster/test_rpc_timeout.py File tests/custom_cluster/test_rpc_timeout.py: PS17, Line 34: i > nit: use _ to show that we don't care about the variable Done PS17, Line 61: test_runtime_filter_rpc > won't pytest try to run this as a test, since it's called test_*? it does. I'll change the name. PS17, Line 72: 3000 > is this definitely large enough to time out? I don't see if you verify that yes, since backend_client_rpc_timeout_ms is set to 1000. I verified log each RPC call did hit the timeout. however hit timeout doesn't necessary lead to query completely failure. It trigger fragment cancel, Cancel() just set runtime state cancel flag, doesn't force query execution stop. by the time execNode check cancel flag, the execution could be already complete and query finish successfully. Also some rpc failure not affect query execution, like the runtime filter one. The only one can guarantee query failure is the ExecPlanFragment(), I'll check query failure for it. PS17, Line 72: - > just want to check this works as intended with a missing '-' yes, both "-" and "--" work. but I'll make them consistent. PS17, Line 73: rpc_timeout1 > nit: can you call these "test_execplanfragment_timeout" etc? Done http://gerrit.cloudera.org:8080/#/c/3343/17/tests/query_test/test_lifecycle.py File tests/query_test/test_lifecycle.py: PS17, Line 36: \ > not needed Done PS17, Line 52: \ > remove Done http://gerrit.cloudera.org:8080/#/c/3343/17/tests/verifiers/metric_verifier.py File tests/verifiers/metric_verifier.py: PS17, Line 68: wait_for_metric_reset > Is this class really necessary, or could we use MetricVerifier and always p we usually need to check metrics for all impalad. I'll modify MetricVerifier to keep initial values. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 17 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Ch
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#17). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call RetryRpcRecv() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response to add back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/common/global-flags.cc M be/src/rpc/thrift-util.cc M be/src/rpc/thrift-util.h M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/service/fragment-exec-state.cc M be/src/service/fragment-mgr.cc M be/src/service/impala-server.cc M be/src/statestore/statestore.cc A be/src/testutil/fault-injection-util.h M be/src/util/error-util-test.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 18 files changed, 392 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/17 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 17 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 16: (6 comments) http://gerrit.cloudera.org:8080/#/c/3343/16/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: Line 226: /// Returns RPC_RECV_TIMEOUT if a timeout occurred, RPC_CLIENT_CONNECT_FAILURE if the > "while waiting for a response" Done Line 254: << e.what() << ", type=" << typeid(e).name(); > nit: align << with above Done PS16, Line 274: or certain RPCs, server could take long time to process it and intentionally block : /// the RPC response to add back pressure. In those cases, client needs to retry recv : /// RPC call to wait longer. > suggest: "In certain cases, the server may take longer to provide an RPC re Done PS16, Line 299: /// Indicate the rpc call sent by this connection succeeds or not. If the rpc call fails : /// for any reason, the connection could be left in a bad state and cannot be reused any : /// more. : bool has_rpc_error_; > Why not proactively destroy the connection when this condition is hit? As discussed offline, I don't want to change destroy the connection right away because the retry logic in DoRpc. http://gerrit.cloudera.org:8080/#/c/3343/16/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS16, Line 228: DoRpcTimedWait > Call this something less generic? Even DoTransmitDataRpc() would work. Done PS16, Line 232: (timeout_ms == 0) ? 0 : > if you just MonotonicMillis() + timeout_ms, and check for MonotonicMillis() timeout == 0 means wait infinitely. I add comments for that. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 16 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#16). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call RetryRpcRecv() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response to add back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/rpc/thrift-util.cc M be/src/rpc/thrift-util.h M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M be/src/util/error-util-test.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 14 files changed, 237 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/16 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 16 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 15: (11 comments) http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: Line 97: void DestroyClient(ClientKey* client_key); > needs a comment Done PS15, Line 197: hasRpcError_ > C++-style naming Done PS15, Line 239: IsTimeoutTException > should now be called IsRecvTimeoutTException() Done PS15, Line 240: RPC_TIMEOUT > should be RPC_RECV_TIMEOUT Done PS15, Line 246: if (strstr(e.what(),"unknown result") != NULL) > this seems very brittle, and at least should be wrapped in an IsXXXExceptio Done PS15, Line 268: release > back pressure is added by a blocking RPC, not released. Done PS15, Line 273: timeout > what's the unit? Done PS15, Line 272: Status DoRpcTimedWait(const F& f, const Request& request, Response* response, : const G& wait_func, uint64_t timeout, RuntimeState* state, bool* can_retry = NULL) > This seems like it breaks abstractions: not all RPCs happen in the context I agree with you. I did it this way in patch#6. but had to duplicate the wait response loop for several rpc calls so I changed it later. PS15, Line 282: bool no_timeout = timeout == 0; > this can be simplified by having a deadline variable: Done Line 290: if (!IsTimeoutTException(e)) { > Just curious, which function in recv_TransmitData() throws timeout exceptio The exception is thrown by thrift, see TSocket.cpp read() recv_TransmitData() will call thrift api to read data from server side. PS15, Line 309: bool hasRpcError_; > the role of this variable is not clear. Please add a comment. Done -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/3343/15/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: PS15, Line 1490: // Try to send the RPC 3 times before failing. > Why try 3 times? Have you seen in your testing where there's failure on the This is to increase the chance the cancel request can reach remote nodes to avoid orphan fragments. If network is not stable, we could get "send expire" error on the coordinator to remote node connection, but the report status callback might keep working so remote nodes don't aware there is connection issue with coordinator. Though DoRpc() will always retry once, in the situation of connection storm, it might not be able to create a new connection at first retry. If you are unlucky, you could get a closed connection from cache (this could happen if CreateClient() in ClientCacheHelper::ReopenClient() fails for previous RPC call). then the cancel request might not get a chance to send out. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 15: exhaustive test and stress test are green. http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3511/ http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-Physical/615/ -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#15). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call DoRpcTimedWait() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response due to back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 11 files changed, 205 insertions(+), 30 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/15 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 14: Code-Review+1 Carry Sailesh's +1 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3343 to look at the new patch set (#14). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call DoRpcTimedWait() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response due to back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 12 files changed, 222 insertions(+), 32 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/14 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 13: (1 comment) I added a section in commit message. Is it enough for explain the test change? "Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. " http://gerrit.cloudera.org:8080/#/c/3343/13/be/src/runtime/plan-fragment-executor.cc File be/src/runtime/plan-fragment-executor.cc: PS13, Line 560: (!done_ && (closed_ || is_cancelled_)) > It would be useful to add a comment here saying that done_ should not be tr will do. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#13). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call DoRpcTimedWait() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response due to back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Added retry for CancelPlanFragment RPC. This reduces the chance that cancel request gets lost due to unstable network, but this can cause cancellation takes longer time. and make test_lifecycle.py more flaky. The metric num-fragments-in-flight might not be 0 yet due to previous tests. Modified the test to check the metric delta instead of comparing to 0 to reduce flakyness. However, this might not capture some failures. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 12 files changed, 219 insertions(+), 32 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/13 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 12: (3 comments) http://gerrit.cloudera.org:8080/#/c/3343/12/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: PS12, Line 1494: NULL > Can't we pass runtime_state() here? We can, but I don't think we need to check it. even if the query is cancelled, we still want to send the cancel request to remote node. And this one has a very short timeout, the rpc won't hang for long time like the one in TransmitData(). http://gerrit.cloudera.org:8080/#/c/3343/12/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: PS12, Line 120: NULL > Same here. Can't we pass executor_.runtime_state() here? see L#93, runtime state may not have been set. and same as CancelRemoteFragment request, this one only has short timeout, it won't hang for long. PS12, Line 122: Status(res.status) > Will this be valid when '!can_retry' is true? Good catch, I'll fix it. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 12 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#12). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call DoRpcTimedWait() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response due to back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 12 files changed, 219 insertions(+), 33 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/12 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 12 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/3343/11/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS11, Line 272: bool* can_retry = NULL > Would it be easy to just pass a 'int num_retry' here, so that the retry loo I am afraid that way this function does too much. I prefer to leave the retry to caller, so caller can decide to use the same connection or a new connection, or adding some delay between retry. PS11, Line 281: //status = WaitRpcResp(wait_func, response, state, timeout); > Is this logic complete? Or will you be posting another patch? sorry forget to remove the old code. and I do need to post a new patch. just found that RuntimeState might not always be available. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#11). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call WaitRpcResp() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response due to back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py M tests/query_test/test_lifecycle.py M tests/verifiers/metric_verifier.py 12 files changed, 219 insertions(+), 33 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/11 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: PS10, Line 1495: rpc_status = backend_client.WaitRpcResp( : ::recv_CancelPlanFragment, , runtime_state(), 100); > By this stage, the send() part of the RPC has already succeeded right? So, yes, you're right. I like the idea of adding a retry flag. http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS10, Line 230: Status DataStreamSender::Channel::DoRpcTimedWait(ImpalaBackendConnection* client, : const TTransmitDataParams& params, TTransmitDataResult* res, : uint64_t timeout) { : Status status = : client->DoRpc(::TransmitData, params, res); : if (status.code() == TErrorCode::RPC_TIMEOUT) { : status = client->WaitRpcResp(::recv_TransmitData, res, : runtime_state_, ONE_HOUR_IN_MS); : } : return status; : } > This would be more cleaner if done as a function in client-cache rather tha Done http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/plan-fragment-executor.cc File be/src/runtime/plan-fragment-executor.cc: PS10, Line 560: if (IsCompleted()) { > What if 'done_' is true here? when done_ is set, PlanFragmentExecutor::FragmentComplete() is always called and status is reported back to coordinator and report thread will be stopped. check done_ here could accidentally stop report thread too soon and cause the fragment completed status not send to coordinator. http://gerrit.cloudera.org:8080/#/c/3343/10/be/src/runtime/plan-fragment-executor.h File be/src/runtime/plan-fragment-executor.h: PS10, Line 146: IsCompleted() > This technically should not be IsCompleted() because we check if (!done_). since it's only used once, I'll just check the vars directly instead of using a function. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#10). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configurable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 (no timeout) because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Added a new RPC call WaitRpcResp() to wait for receiver response for longer time. This is needed by certain RPCs. For example, TransmitData() by DataStreamSender, receiver could hold response due to back pressure. If an RPC call fails, we don't put the underlying connection back to cache but close it. This is to make sure bad state of this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 11 files changed, 206 insertions(+), 25 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/10 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 9: (7 comments) http://gerrit.cloudera.org:8080/#/c/3343/9//COMMIT_MSG Commit Message: PS9, Line 9: configable > configurable Done PS9, Line 21: keep default timeout to 0 > maybe mention in brackets that this means the timeouts are disabled. Done Line 24: > Also mention the WaitRpcResp() Done PS9, Line 25: we don't put it back to cache > "we don't put the underlying connection back in the cache" Done http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/client-cache.cc File be/src/runtime/client-cache.cc: Line 147: shared_ptr client_impl; > DCHECK(*client_key != NULL) Done http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS9, Line 306: Status status = client.DoRpc(::TransmitData, params, ); : if (status.code() == TErrorCode::RPC_TIMEOUT) { : status = client.WaitRpcResp(::recv_TransmitData, , : runtime_state_, ONE_HOUR_IN_MS); : } : rpc_status_ = status; > This code seems to be reused in many places. Maybe you can create a wrapper Done http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/plan-fragment-executor.cc File be/src/runtime/plan-fragment-executor.cc: PS9, Line 456: Note > "Note:" Done -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 9 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/3343/9/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS9, Line 216: ONE_HOUR_IN_MS > How was one hour chosen? I'm worried this might cause lot of queries to han yes, that could happen. but we don't want very short timeout here. upstream operator sometimes could take very long time and I don't know how long it could be. This is like the last mean to detect the pair node failure. In most of cases, the dest node failure will be detected by statestore heartbeat and the query will be cancelled before hitting this timeout. RPC_TIMEOUT error is checked by IsTimeoutTException(), only if the error contains "EAGAIN (timed out)". It happens only on TSocket::read() -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 9 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#9). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. If an RPC call fails, we don't put it back to cache but close it. This is to make sure bad state in this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 11 files changed, 200 insertions(+), 24 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/9 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 9 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#8). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. If an RPC call fails, we don't put it back to cache but close it. This is to make sure bad state in this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-mgr.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 12 files changed, 204 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/8 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 8 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#7). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. If an RPC call fails, we don't put it back to cache but close it. This is to make sure bad state in this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-mgr.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 12 files changed, 205 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/7 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 7 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#6). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. If an RPC call fails, we don't put it back to cache but close it. This is to make sure bad state in this connection won't cause more RPC failure. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-mgr.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 12 files changed, 203 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/6 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: PS5, Line 48: // Consider back pressure, we should wait for receiver side as long as possible : // but not forever. Also we need to make sure Cancel() can stop the waiting. > I preferred the old way where you could set per-RPC timeouts. You can combi I have the same concern as Sailesh, there might be performance impact if we set timeout per RPC. Also maintain a different timeout for each type of RPC could be hard. I agree move retry to client is better. http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS5, Line 240: Reopen() This could causing lots of issues. Without checking exception details, we don't know if the exception is from socket send() or recv(), simply retry both rpc could lead to send the request multiple times. For example, ExecutePlanFragment() called twice for the same fragment, the same row batch is sent twice, etc. http://gerrit.cloudera.org:8080/#/c/3343/5/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: PS5, Line 1490: // Try to send the RPC 3 times before failing. : for (int i = 0; i < 3; ++i) { > This always sends the RPC three times if it succeeds every time. You're right, I should check if status is ok. PS5, Line 1494: if (rpc_status.code() == TErrorCode::RPC_TIMEOUT) break; > If you abort after a timeout error, what's the rationale for sending the RP IsTimeoutTException() checks for a specific timedout that can only happen at socket recv(), which means send() is already done. remote node gets the cancel request, no need to retry. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#5). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout This patch adds a configable timeout for all backend client RPC calls to avoid query hang issue. Impala doesn't set socket send/recv timeout for backend client. RPC calls will wait forever for data. In extreme case of bad network, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread does not quiting even after query is cancelled. For catalog client, keep default timeout to 0 because ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 11 files changed, 164 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/5 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 4: (7 comments) http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.cc File be/src/runtime/client-cache.cc: Line 125: void ClientCacheHelper::ReleaseClient(ClientKey* client_key) { > I think it would be better to reset the timeouts to the default values when Good idea. I am still running tests to see if we do need different timeout for different RPC calls. http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: PS4, Line 101: SetRpcTimeout > Would this be clearer as SetClientRpcTimeout() here and below? On first loo agreed PS4, Line 239: SetRpcTimeout(send_timeout, recv_timeout); > I'm a little worried about calling this every time we call a DoRpc() since You're right, set the timeout per client makes sense. http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/runtime/plan-fragment-executor.cc File be/src/runtime/plan-fragment-executor.cc: PS4, Line 556: is_cancelled_ > Can this also be IsCompleted()? Is it a possibility that there is a sequenc StopReportThread is called if fragment is completed, or cancel happened during GetNext(). If an internal error happened at other time, it usually trigger internal cancelling, Cancel() will be called, But I agree, IsCompleted() seems a safer check here. Line 559: UpdateStatus(Status(TErrorCode::CANCELLED, "Fragment already cancelled.")); > Should this be an error? Or is it better to just return since the user will I don't think this should be treat as error. the only issue is the report thread keep running and send status to coordinator which is confusing. I'll just call StopReportThread(). http://gerrit.cloudera.org:8080/#/c/3343/4/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: PS4, Line 128: cause problem > If we know exactly what sort of problems it could cause, it would be worth UpdateFragmentExecStatus() will decrease num_remaining_fragment_instances_ counter, call it multiple times for a completed fragment will cause DCHECK. I'll add those to comments. http://gerrit.cloudera.org:8080/#/c/3343/4/tests/custom_cluster/test_rpc_timeout.py File tests/custom_cluster/test_rpc_timeout.py: PS4, Line 39: assert "RPC" in str(ex) > 'ex' should contain "RPC timed out" right? This would pass for "RPC Error" We used to capture only one specific timeout case, and try reopen client for all other errors. I am not sure that's correct way so I modified the exception handling. But now I see more RPC error case than before. Maybe I should roll back the exception handling change. -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has posted comments on this change. Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3343/1/be/src/runtime/exec-env.cc File be/src/runtime/exec-env.cc: PS1, Line 140: DEFINE_int32(catalog_client_send_timeout_ms, 0, > will use a single one and will test if this affect large catalog, or a tabl We do need a significant larger timeout for catalog connection. when execute ddl query, if table has many partitions, HMS API call could take very long time, e.g. 10K partition could take 100s+ for 100K, it could take an hour -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#4). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout Impala doesn't set socket send/recv timeout. RPC calls will wait forever for data. In extreme case of network failure, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread not quiting even after query is cancelled. For catalog client, keep default timeout to 0. ExecDdl() could take very long time if table has many partitons, mainly waiting for HMS API call. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 9 files changed, 147 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/4 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new patch set (#3). Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout Impala doesn't set socket send/recv timeout. RPC calls will wait forever for data. In extreme case of network failure, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread not quiting even after query is cancelled. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py A tests/custom_cluster/test_rpc_timeout.py 9 files changed, 144 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/3 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3575: Add retry to backend connection request and rpc timeout
Juan Yu has uploaded a new change for review. http://gerrit.cloudera.org:8080/3343 Change subject: IMPALA-3575: Add retry to backend connection request and rpc timeout .. IMPALA-3575: Add retry to backend connection request and rpc timeout Impala doesn't set socket send/recv timeout. RPC calls will wait forever for data. In extreme case of network failure, or destination host has kernel panic, sender will not get response and rpc call will hang. Query hang is hard to detect. if hang happens at ExecRemoteFragment() or CancelPlanFragments(), query cannot be canelled unless you restart coordinator. Added send/recv timeout to all rpc calls to avoid query hang. And fix a bug that reporting thread not quiting even after query is cancelled. Besides the new EE test, I used the following iptable rule to inject network failure to make sure rpc call never hang. 1. Block network traffic on a port completely iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP 2. Randomly drop 5% of TCP packet to slowdown network iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 --- M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/exec-env.cc M be/src/runtime/plan-fragment-executor.cc M be/src/runtime/plan-fragment-executor.h M be/src/service/fragment-exec-state.cc M be/src/statestore/statestore.cc M common/thrift/generate_error_codes.py 8 files changed, 99 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/3343/2 -- To view, visit http://gerrit.cloudera.org:8080/3343 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Alan Choi <a...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has submitted this change and it was merged. Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. On ec2 machine, the timer error could be a little larger, increase MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure. Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Reviewed-on: http://gerrit.cloudera.org:8080/2838 Reviewed-by: Juan Yu <j...@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 7797033f8c627d5df6ba477b78dec5347b7e6caa) Reviewed-on: http://gerrit.cloudera.org:8080/2898 Tested-by: Juan Yu <j...@cloudera.com> --- M be/src/util/runtime-profile-test.cc 1 file changed, 10 insertions(+), 10 deletions(-) Approvals: Juan Yu: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/2898 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.5.0_5.7.x Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has posted comments on this change. Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Patch Set 1: Code-Review+2 Verified+1 Clean cherry-pick, verified by http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3002/ -- To view, visit http://gerrit.cloudera.org:8080/2898 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.5.0_5.7.x Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has submitted this change and it was merged. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock network time used by DataStreamSender no matter how many threads are used so we only count concurrent sending time and row batch serialization time. We don't want to count recv_TransmitData() time because it is mainly to wait for receiver to deserialize data and process data, not using much network, plus those part are already tracked by Exchange node. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Reviewed-on: http://gerrit.cloudera.org:8080/2578 Reviewed-by: Juan Yu <j...@cloudera.com> Reviewed-by: Dan Hecht <dhe...@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 18434bae383acb38c4a24e4a8d7484f6a173f354) Reviewed-on: http://gerrit.cloudera.org:8080/2897 Tested-by: Juan Yu <j...@cloudera.com> --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 371 insertions(+), 44 deletions(-) Approvals: Juan Yu: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/2897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.5.0_5.7.x Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Hello Internal Jenkins, Dan Hecht, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/2897 to review the following change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock network time used by DataStreamSender no matter how many threads are used so we only count concurrent sending time and row batch serialization time. We don't want to count recv_TransmitData() time because it is mainly to wait for receiver to deserialize data and process data, not using much network, plus those part are already tracked by Exchange node. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Reviewed-on: http://gerrit.cloudera.org:8080/2578 Reviewed-by: Juan Yu <j...@cloudera.com> Reviewed-by: Dan Hecht <dhe...@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 18434bae383acb38c4a24e4a8d7484f6a173f354) --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 371 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/97/2897/1 -- To view, visit http://gerrit.cloudera.org:8080/2897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.5.0_5.7.x Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Internal Jenkins
[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 1: Code-Review+2 Verified+1 Clean cherry pick. verified by http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3002/ -- To view, visit http://gerrit.cloudera.org:8080/2897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.5.0_5.7.x Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-2.5.0_5.7.x) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Hello Internal Jenkins, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/2898 to review the following change. Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. On ec2 machine, the timer error could be a little larger, increase MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure. Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Reviewed-on: http://gerrit.cloudera.org:8080/2838 Reviewed-by: Juan Yu <j...@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 7797033f8c627d5df6ba477b78dec5347b7e6caa) --- M be/src/util/runtime-profile-test.cc 1 file changed, 10 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/98/2898/1 -- To view, visit http://gerrit.cloudera.org:8080/2898 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.5.0_5.7.x Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Internal Jenkins
Re: Impala query memory limit exceeded
Besides adding join condition, could you also check scratch directory permission?The memory consumption is much lower than mem limit when it throw oom. I suspect spill fails. On Apr 27, 2016 10:05 PM, "Pradeep Nayak"wrote: > I will see if it works with specifying the join condition. > > On Wed, Apr 27, 2016 at 2:28 PM Tim Armstrong > wrote: > >> You need to specify a join condition. I.e. table_a join table_b on >> table_a.key = table_2.key >> >> table_a join table_b computes the cross-product of the two tables. >> >> You can confirm by running explain: >> >> explain select count(*) from tab3_test_new join tab4_test_new >> >> >> It will show a plan like this with a "CROSS JOIN" in it. >> >> [tarmstrong-box.ca.cloudera.com:21000] > explain select * from >> functional.alltypes join functional.alltypes t2; >> Query: explain select * from functional.alltypes join functional.alltypes >> t2 >> +---+ >> | Explain String| >> +---+ >> | Estimated Per-Host Requirements: Memory=320.68MB VCores=2 | >> | | >> | 04:EXCHANGE [UNPARTITIONED] | >> | | | >> | 02:NESTED LOOP JOIN [CROSS JOIN, BROADCAST] | >> | | | >> | |--03:EXCHANGE [BROADCAST]| >> | | | | >> | | 01:SCAN HDFS [functional.alltypes t2] | >> | | partitions=24/24 files=24 size=478.45KB | >> | | | >> | 00:SCAN HDFS [functional.alltypes]| >> |partitions=24/24 files=24 size=478.45KB| >> +---+ >> >> >> On Wed, Apr 27, 2016 at 2:21 PM, Pradeep Nayak >> wrote: >> >>> 40 CPU and 156GB RAM per nodemanager. >>> We are trying to do some validation on Impala. We are using a 5 node >>> hadoop >>> cluster m4.10x large machines on the aws. These were setup with cloudera >>> manager. Each node: 40 cpus' and 156GB ram. >>> >>> We are trying to do a count(*) on two tables with 14 million rows each. >>> When do a count(*) individually on each of these nodes, then it works >>> fine. >>> However we try to do a join on these two tables and do a count(*) we hit >>> the problem below saying memory limit exceeded. >>> >>> Any inputs here are appreciated ? Should we need a bigger cluster for >>> this >>> ? How do we decide the size of the cluster. >>> >>> >>> >>> >>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new; >>> >>> Query: select count(*) from tab3_test_new >>> >>> +--+ >>> >>> | count(*) | >>> >>> +--+ >>> >>> | 14216336 | >>> >>> +--+ >>> >>> Fetched 1 row(s) in 8.29s >>> >>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab4_test_new; >>> >>> Query: select count(*) from tab4_test_new >>> >>> +--+ >>> >>> | count(*) | >>> >>> +--+ >>> >>> | 14987634 | >>> >>> +--+ >>> >>> Fetched 1 row(s) in 11.46s >>> >>> [ip-172-30-1-57.ec2.internal:21000] > >>> >>> ip-172-30-1-57.ec2.internal:21000] > set mem_limit=64g; >>> >>> MEM_LIMIT set to 64g >>> >>> [ip-172-30-1-57.ec2.internal:21000] > select count(*) from tab3_test_new >>> join tab4_test_new; >>> >>> Query: select count(*) from tab3_test_new join tab4_test_new >>> >>> WARNINGS: >>> >>> Memory Limit Exceeded >>> >>> Query(b44dde4886ccafd2:cfb6b9ffcdc4e3a6) Limit: Limit=64.00 GB >>> Consumption=80.77 MB >>> >>> Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a7: Consumption=12.00 KB >>> >>> AGGREGATION_NODE (id=6): Consumption=4.00 KB >>> >>> EXCHANGE_NODE (id=5): Consumption=0 >>> >>> DataStreamRecvr: Consumption=0 >>> >>> Block Manager: Limit=156.00 MB Consumption=0 >>> >>> Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a8: Consumption=64.62 MB >>> >>> AGGREGATION_NODE (id=3): Consumption=4.00 KB >>> >>> NESTED_LOOP_JOIN_NODE (id=2): Consumption=64.06 MB >>> >>> HDFS_SCAN_NODE (id=0): Consumption=0 >>> >>> EXCHANGE_NODE (id=4): Consumption=0 >>> >>> DataStreamRecvr: Consumption=544.00 KB >>> >>> DataStreamSender: Consumption=16.00 KB >>> >>> Fragment b44dde4886ccafd2:cfb6b9ffcdc4e3a9: Consumption=16.14 MB >>> >>> HDFS_SCAN_NODE (id=1): Consumption=16.01 MB >>> >>> DataStreamSender: Consumption=128.00 KB >>> >>> WARNING: The following tables are missing relevant table and/or column >>> statistics. >>> >>> test_db.tab3_test_new,test_db.tab4_test_new >>> >>> >>> >>> [ip-172-30-1-57.ec2.internal:21000] > >>> >> >> -- >> You received this message because you are subscribed to the Google
[Impala-CR](cdh5-trunk) IMPALA-2660: Respect auth_to_local configs from hdfs configs
Juan Yu has posted comments on this change. Change subject: IMPALA-2660: Respect auth_to_local configs from hdfs configs .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/2800/3/fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java File fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java: Line 201: } > I don't think so, we are re-throwing the wrapped exception inside catch(), You're right, at this point, there must be exception. -- To view, visit http://gerrit.cloudera.org:8080/2800 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76485b83c14ba26f6fce66e5f83e8014667829e0 Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has posted comments on this change. Change subject: IMPALA-3396: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Patch Set 4: Code-Review+2 Add JIRA number to commit message. carry Tim's +2 -- To view, visit http://gerrit.cloudera.org:8080/2838 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-2660: Respect auth_to_local configs from hdfs configs
Juan Yu has posted comments on this change. Change subject: IMPALA-2660: Respect auth_to_local configs from hdfs configs .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/2800/3/fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java File fe/src/main/java/com/cloudera/impala/authorization/AuthorizationChecker.java: Line 201: } don't we need to return false at the end? -- To view, visit http://gerrit.cloudera.org:8080/2800 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76485b83c14ba26f6fce66e5f83e8014667829e0 Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has uploaded a new patch set (#3). Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. On ec2 machine, the timer error could be a little larger, increase MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure. Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f --- M be/src/util/runtime-profile-test.cc 1 file changed, 10 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/2838/3 -- To view, visit http://gerrit.cloudera.org:8080/2838 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has posted comments on this change. Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/2838/2/be/src/util/runtime-profile-test.cc File be/src/util/runtime-profile-test.cc: Line 614: if (!(stopwatch_value >= expected_value - TimerCounterTest::MAX_TIMER_ERROR_NS && > If you use EXPECT_LE and EXPECT_GE to check the conditions it will log the How come I forget this. Thanks Tim. -- To view, visit http://gerrit.cloudera.org:8080/2838 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has posted comments on this change. Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/2838/2/be/src/util/runtime-profile-test.cc File be/src/util/runtime-profile-test.cc: Line 614: if (!(stopwatch_value >= expected_value - TimerCounterTest::MAX_TIMER_ERROR_NS && Log the timer error if it is larger than the TimerCounterTest::MAX_TIMER_ERROR_NS so if the test fails again we could check if it's in valid range or not. -- To view, visit http://gerrit.cloudera.org:8080/2838 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has uploaded a new patch set (#2). Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. On ec2 machine, the timer error could be a little larger, increase MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure. Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f --- M be/src/util/runtime-profile-test.cc 1 file changed, 12 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/2838/2 -- To view, visit http://gerrit.cloudera.org:8080/2838 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-trunk) Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure.
Juan Yu has uploaded a new change for review. http://gerrit.cloudera.org:8080/2838 Change subject: Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. .. Fix ConcurrentTimerCounter unit test "TimerCounterTest" failure. On ec2 machine, the timer error could be a little larger, increase MAX_TIMER_ERROR_NS to 15ms to avoid flaky test failure. Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f --- M be/src/util/runtime-profile-test.cc 1 file changed, 10 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/2838/1 -- To view, visit http://gerrit.cloudera.org:8080/2838 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I807a774da27ff6215f790ec44c9706fa6f95639f Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3382: Long warnings cause impala-shell to stall
Juan Yu has posted comments on this change. Change subject: IMPALA-3382: Long warnings cause impala-shell to stall .. Patch Set 1: (4 comments) http://gerrit.cloudera.org:8080/#/c/2821/1//COMMIT_MSG Commit Message: Line 7: stall stale? Could you add some explanation about why the long warning cause shell to stale? http://gerrit.cloudera.org:8080/#/c/2821/1/shell/impala_client.py File shell/impala_client.py: Line 459: 1 use a constant for this Line 460: return "WARNINGS: %s \n(truncated, please use --display_full_warning to display more)" % log[0:1] long line http://gerrit.cloudera.org:8080/#/c/2821/1/shell/option_parser.py File shell/option_parser.py: Line 111: store_true We want the option default to false? right? -- To view, visit http://gerrit.cloudera.org:8080/2821 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5d4ea7fc3e82f11151c170a9835ca9fcbbb3a94a Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Marcell Szabo <sz...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 8: Code-Review+1 Carry previous +1 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 8 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Hello Henry Robinson, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2578 to look at the new patch set (#8). Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock network time used by DataStreamSender no matter how many threads are used so we only count concurrent sending time and row batch serialization time. We don't want to count recv_TransmitData() time because it is mainly to wait for receiver to deserialize data and process data, not using much network, plus those part are already tracked by Exchange node. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 371 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/8 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 8 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/2578/6//COMMIT_MSG Commit Message: Line 10: time > network time Done http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: Line 46: recv_TransmitData > The commit message wasn't clear to me, this is much clearer, thanks. How ab Done -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: Line 40: if (transmit_csw_ != NULL) { > Oh yeah, I forgot we overload TransmitData with the eos. Okay, then you ca Yeah, we should add a separate one for eos, then the TransmitData will be just for network data sending and we should always track the time. -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: Line 40: if (transmit_csw_ != NULL) { > Do you mean the SendCurrentBatch() there? FYI, the one in unit test is also for sending eos flag. And it's for receiver test, not related to DataStreamSender at all. Those ImpalaBackendClient thrift send/recv functions could be used for different purpose. I think we should let caller decide if they want to track time or not. -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Hello Henry Robinson, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2578 to look at the new patch set (#7). Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock time used by DataStreamSender no matter how many threads are used so we only count concurrent sending time and row batch serialization time. We don't want to count recv_TransmitData() time because it is mainly to wait for receiver to deserialize data and process data, not using much network, plus those part are already tracked by Exchange node. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 370 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/7 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 7 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/2578/6/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: Line 40: if (transmit_csw_ != NULL) { > Are there any cases where we want to do TransmitData() but not set the csw? A couple of cases: 1. Status DataStreamSender::Channel::FlushAndSendEos(RuntimeState* state) this is not sending any data but just eos flag. 2. unit test, I could add a counter for data stream test. Line 46: recv_TransmitData > why don't we care about recv time? I explained this earlier, also mentioned in commit message. We only want to count the send_TransmitData() timing because that's the main job of DataStreamSender and that's where the network traffic spend. the recv_TransmitData() is mainly to wait for receiver to deserialize data and process data, not using much network, plus those part are already tracked by Exchange node. -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has uploaded a new patch set (#6). Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock time used by DataStreamSender no matter how many threads are used so ignore receiver waiting time and only count concurrent sending time and batch serialization time. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 370 insertions(+), 44 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/6 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 5: (14 comments) http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: Line 66: /// Common cache / connection types > remove this comment (there's only one cache / connection type here etc.) Done http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/runtime/data-stream-test.cc File be/src/runtime/data-stream-test.cc: Line 630: client.DoRpc(::TransmitData, params, ); > nit: one line? Done http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/runtime/exec-env.cc File be/src/runtime/exec-env.cc: Line 200: new ImpalaBackendClientCache( > one line? Done http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/service/fragment-exec-state.cc File be/src/service/fragment-exec-state.cc: Line 118: coord.DoRpc(::ReportExecStatus, params, ); > nit: one line? Done http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/util/runtime-profile-test.cc File be/src/util/runtime-profile-test.cc: Line 541: : timercounter_(TUnit::TIME_NS) {} > nit: one line? Done Line 548: NULL > just for simplicity, why not assign this in the constructor? Then line 582 Sorry I don't get it. do you mean workers_.push_back(DummyWorker(new thread(...))); How do I pass DummyWorker to TimerCounterTest::Run()? Line 605: 2000 > why did this change so much? by comparing with wall clock, the diff should be under 1ms. done Line 605: MAX_TIMER_ERROR > add units to the variable name (ns?) Done Line 622: TimerCounterTest& timer > this should be a pointer if it's going to have non-const methods called on Done http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/util/runtime-profile.h File be/src/util/runtime-profile.h: Line 382: : Counter(unit) {} > nit: one line? Done Line 392: uint64_t LapTime() { return csw_.LapTime(); } > needs a comment Done http://gerrit.cloudera.org:8080/#/c/2578/5/be/src/util/stopwatch.h File be/src/util/stopwatch.h: Line 182: /// Utility class to measure multiple threads concurrent running time. > I think it's important to say *wall* time here, otherwise users might expec Done Line 207: and reset lap_time_ : /// back to 0. > can remove this bit of the comment: it's an implementation detail (the beha Done Line 228: /// Lock with busy_threads_. > Nit: put the lock above the thing it protects (easiest thing is to swap msw Done -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has uploaded a new patch set (#5). Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock time used by DataStreamSender no matter how many threads are used so ignore receiver waiting time and only count concurrent sending time and batch serialization time. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 376 insertions(+), 39 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/5 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 4: (11 comments) http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: Line 209: COUNTER_ADD(parent_->profile_->total_time_counter(), > I'm not totally sure if this does the right thing - presumably you want to I don't want to add the wall-clock time for all sender threads together. The DataStreamSender total_time = total serialize time + thrift_transmit_timer_. I want to add delta here to get Total_time updated frequently, the lap time is what I want. I'll cleanup the LapTime() implementation like you suggested. http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/util/runtime-profile-test.cc File be/src/util/runtime-profile-test.cc: Line 550: delete thread_handle; > how does this work if thread_handle is NULL (see line 584)? Done Line 557: while(!worker->done) { > space before ( Done Line 577: void StopWorkers(int index) { > needs a comment. maybe call 'index' something like 'exclude_idx' and defaul Done Line 580: index >= 0 > how would index be < 0? I'll add comments. StopWorkers() can stop a specific thread, or all threads. -1 means all threads. Line 589: static const int MAX_NUM_THREADS = 5; > why have this, rather than just initialize workers_[] in StartWorkers? Removed. Line 608: void ValidateLapTime(TimerCounterTest& timer, int64_t expected_value) { > missing const? This one cannot be const, LapTime() will update ConcurrentStopWatch member variable. Line 638: int running_time = 0; > rather than tracking running_time per sleep-period, it might be better to u Thanks for pointing this out, Done. Line 659: tester.StartWorkers(2, 0); > suggest adding a sleep before this to show that the time between working pe Done http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/util/runtime-profile.h File be/src/util/runtime-profile.h: Line 379: If a thread is already running, the following thread won't reset the stop watch. > this is confusing - what's the following thread? Done http://gerrit.cloudera.org:8080/#/c/2578/4/be/src/util/stopwatch.h File be/src/util/stopwatch.h: Line 208: /// Returns running time since last time LapTime() is called and reset lap_time_ : /// back to 0. > this doesn't seem to match the implementation. If we call LapTime() twice i Sorry for the confusion. I do need Lap time, for caller who wants delta update of concurrent running time. . -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has uploaded a new patch set (#4). Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. IMPALA-2076: Correct execution time tracking for DataStreamSender. DataStreamSender uses multiple channels (each channel uses one thread) to send data. We want to know the active wall-clock time used by DataStreamSender no matter how many threads are used so ignore receiver waiting time and only count concurrent sending time and batch serialization time. Added a ConcurrentStopWatch class to get concurrent transmit time. If a thread is already running, the following thread won't reset the stop watch. And the stop watch is stopped only after all threads finish their work. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 --- A be/src/runtime/backend-client.h M be/src/runtime/client-cache.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M be/src/runtime/data-stream-test.cc M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M be/src/runtime/runtime-filter.cc M be/src/runtime/runtime-state.h M be/src/service/fragment-exec-state.cc M be/src/service/fragment-exec-state.h M be/src/util/runtime-profile-test.cc M be/src/util/runtime-profile.cc M be/src/util/runtime-profile.h M be/src/util/stopwatch.h 17 files changed, 359 insertions(+), 39 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/4 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time tracking for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time tracking for DataStreamSender. .. Patch Set 3: (14 comments) http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/runtime/backend-client.h File be/src/runtime/backend-client.h: Line 15: #include "runtime/client-cache.h" > Unused? Defined ImpalaBackendClient in this header and keep this. Line 22: /// Backend Client that can take a RuntimeProfile::ConcurrentTimerCounter to track : /// transmit data time for multiple threads. If more than one threads send data : /// at the same time, the transmit time is count only once. > Let's make this a bit more clear, e.g.: Done Line 27: ImpalaBackendClient(boost::shared_ptr< ::apache::thrift::protocol::TProtocol> prot) > is this space needed? Thanks for pointing it out. yes, it's needed. without space, compiler will complain. In file included from /home/cdh52/Impala/be/src/runtime/exec-env.h:26: /home/cdh52/Impala/be/src/runtime/backend-client.h:30:40: error: found '<::' after a template name which forms the digraph '<:' (aka '[') and a ':', did you mean '< ::'? ImpalaBackendClient(boost::shared_ptr<::apache::thrift::protocol::TProtocol> prot) ^~~ < :: Line 30: ImpalaBackendClient(boost::shared_ptr< ::apache::thrift::protocol::TProtocol> iprot, > blank line before this one Done Line 45: Caller who wants to track data transmit time need to set its own counter. > Maybe: "Callers of TransmitData() should provide their own counter to measu Done Line 51: /// ImpalaBackendClient is shared by multiple queries. It's caller's responsibility > It's the caller's... Done Line 52: /// to reset the counter after data transmition. > transmission Done http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: Line 350: class ImpalaBackendClient; : typedef ClientCache ImpalaBackendClientCache; : typedef ClientConnection ImpalaBackendConnection; > Let's move these to the header that defines ImpalaBackendClient (and then y Done http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: Line 383: ThriftTransmitTime > We use (*) to indicate that a timer is the sum of several counter's wall ti Done http://gerrit.cloudera.org:8080/#/c/2578/3/be/src/util/stopwatch.h File be/src/util/stopwatch.h: Line 187: last_running_time_(0), : busy_threads_(0) { : } > fit this more onto one line Done Line 190: ~ConcurrentStopWatch() { : msw_.Stop(); : } > I don't think you need this, because ConcurrentStopWatch isn't supposed to Done Line 214: LastRunningTime > I think we would usually call this the LapTime() (like on a regular stopwat Done Line 226: /// This is for caller who just want the most recent running time.. > Can you define what this is more clearly? Done Line 233: boost::mutex thread_counter_lock_; > I think this is a case where a SpinLock would make some sense, since the cr Done -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 3 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 15: Code-Review+2 Carry Dan's +2 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/2543/14/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 306: 8mb > STREAM_OUT_BUF_SIZE Done -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 14: rebase, carry Dan's +2 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Hello Skye Wanderman-Milne, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2543 to look at the new patch set (#15). Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. IMPALA-3038: Add multistream gzip/bzip2 test coverage Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 --- M be/src/testutil/gtest-util.h M be/src/util/codec.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 4 files changed, 123 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/15 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 15 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 14: Code-Review+1 Update comments and trivial change. Carry forward +1 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Hello Skye Wanderman-Milne, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2543 to look at the new patch set (#14). Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. IMPALA-3038: Add multistream gzip/bzip2 test coverage Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 --- M be/src/testutil/gtest-util.h M be/src/util/codec.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 4 files changed, 122 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/14 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 14 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 13: (8 comments) http://gerrit.cloudera.org:8080/#/c/2543/13/be/src/util/decompress-test.cc File be/src/util/decompress-test.cc: Line 187: Input is NULL or input_len is 0 > the code checks compressed_length, not 'input' nor 'input_len'. should the Done Line 232: . > let's explicitly say: "We should get expected_stream_end == false but with Done Line 252: 8MB > which constant does this correspond to? can we make the other constants der This is STREAM_OUT_BUF_SIZE defined in decompress.cc. I can move it to codec.h to expose it. Line 258: rand > can the seed be specified at the command line (to reproduce failures)? see L366, the seed is printed so we could reproduce failures. Line 273: some space > is the "some space" why we do the - RAW_INPUT_SIZE? Or is the choice of RA Yes, it's why we do - RAW_INPUT_SIZE. Done Line 291: 16 > summarize how this is chosen. Done Line 292: 32 > and this Done Line 294: 2 * 26 > and this Done -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Hello Skye Wanderman-Milne, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2543 to look at the new patch set (#13). Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. IMPALA-3038: Add multistream gzip/bzip2 test coverage Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 --- M be/src/testutil/gtest-util.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 109 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/13 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 13: Code-Review+1 Carry Skye's +1 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 13 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time for DataStreamSender. .. Patch Set 2: @Henry, Thanks for reviewing and suggestion. I'll think about adding a new counter type for the acquire / release semantics from UpdateSenderStats() and try to get rid of the new DoRpc We only want to count the send_TransmitData() timing because that's the main job of DataStreamSender and that's where the network traffic spend. the recv_TransmitData() is mainly to wait for receiver to deserialize data and process data, not using much network, plus those part are already tracked by Exchange node. I'll update commit message with more details. -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: No
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Hello Skye Wanderman-Milne, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2543 to look at the new patch set (#12). Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. IMPALA-3038: Add multistream gzip/bzip2 test coverage Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 --- M be/src/testutil/gtest-util.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 109 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/12 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 12 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/2543/10/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 95: *stream_end = input_length == 0; > This is for the special case that caller pass in a NULL input, only used in On a second thought. since this is an output parameter, shouldn't decompressor always set it base on all info it has (in case if caller forget to initialize it)? maybe set *stream_end = false; -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Hello Skye Wanderman-Milne, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/2543 to look at the new patch set (#11). Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. IMPALA-3038: Add multistream gzip/bzip2 test coverage Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 --- M be/src/testutil/gtest-util.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 109 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/11 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 11 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has posted comments on this change. Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/2543/10/be/src/util/decompress.cc File be/src/util/decompress.cc: Line 95: *stream_end = input_length == 0; > Are you sure this is always correct thing to do given the way HdfsTextScann This is for the special case that caller pass in a NULL input, only used in test case. I don't think scanner will send NULL input to decompressor. Maybe I should tweak the test case to not always check stream_end. -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time for DataStreamSender. .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/2578/2/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: Line 249: /// Similar RPC call wrapper as DoRpc() but separate request and response so we could : /// profile thrift transmit time properly. > This is just for TransmitData(). I also used tcpdump to monitor network traffic, same amount of packets are sent w/o the change. -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) IMPALA-3038: Add multistream gzip/bzip2 test coverage
Juan Yu has uploaded a new patch set (#10). Change subject: IMPALA-3038: Add multistream gzip/bzip2 test coverage .. IMPALA-3038: Add multistream gzip/bzip2 test coverage Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 --- M be/src/testutil/gtest-util.h M be/src/util/decompress-test.cc M be/src/util/decompress.cc 3 files changed, 107 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/43/2543/10 -- To view, visit http://gerrit.cloudera.org:8080/2543 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9b0e1971145dd457e71fc9c00ce7c06fff8dea88 Gerrit-PatchSet: 10 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Skye Wanderman-Milne <s...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.
Juan Yu has posted comments on this change. Change subject: IMPALA-2076: Correct execution time for DataStreamSender. .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/2578/2/be/src/runtime/client-cache.h File be/src/runtime/client-cache.h: Line 249: /// Similar RPC call wrapper as DoRpc() but separate request and response so we could : /// profile thrift transmit time properly. > Are we okay with the increase in network traffic once we split this up? This is just for TransmitData(). It is two calls currently. I don't think it increase network traffic. void ImpalaInternalServiceClient::TransmitData(TTransmitDataResult& _return, const TTransmitDataParams& params) { send_TransmitData(params); recv_TransmitData(_return); } The recv_TransmitData() is used to control back pressure. if upstream node cannot consume data as fast as needed, this will slowdown sender to avoid data congestion. -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com> Gerrit-HasComments: Yes
[Impala-CR](cdh5-trunk) PREVIEW IMPALA-2076: Part2 - Show DataStreamSender in ExecSummary under EXCHANGE node.
Juan Yu has abandoned this change. Change subject: PREVIEW IMPALA-2076: Part2 - Show DataStreamSender in ExecSummary under EXCHANGE node. .. Abandoned will take a different approach to add non-plan node profile to ExecSummary -- To view, visit http://gerrit.cloudera.org:8080/2579 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: Ia53206fdd02bd72294f86ab9dd0b71b58303d33e Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-trunk) IMPALA-2076: Correct execution time for DataStreamSender.
Juan Yu has uploaded a new patch set (#2). Change subject: IMPALA-2076: Correct execution time for DataStreamSender. .. IMPALA-2076: Correct execution time for DataStreamSender. Ignore receiver waiting time and only count thrift sending time and batch serialization time for DataStreamSender. Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 --- M be/src/runtime/client-cache.h M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h 3 files changed, 86 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/78/2578/2 -- To view, visit http://gerrit.cloudera.org:8080/2578 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9c530b2056d852c8bcac6263e9e6b1a6bede1047 Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Juan Yu <j...@cloudera.com>
[Impala-CR](cdh5-2.2.0_5.4.x) CDH-38160: Prevent destructing uninitialized avro schema pointers
Juan Yu has posted comments on this change. Change subject: CDH-38160: Prevent destructing uninitialized avro schema pointers .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/2582 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I02ae0ba586887e4bb713b993edd208f9421dd1a8 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-2.2.0_5.4.x Gerrit-Owner: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Huaisi Xu <h...@cloudera.com> Gerrit-Reviewer: Juan Yu <j...@cloudera.com> Gerrit-HasComments: No