[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Alexey Serbin has abandoned this change. ( http://gerrit.cloudera.org:8080/8017 ) Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Abandoned Since the fix for KUDU-2149 has been committed, this test became stable even with its original settings. -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-Change-Number: 8017 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Alexey Serbin has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: > What's the verdict on this? It seems like the test is no longer as > flaky as it was last week. Did we fix something? I was thinking to take a closer look at the reason behind this test starting being flaky since the mentioned changelist 21b0f3d5, but I haven't done that yet. As for the less flaky observed for this test, nothing has been fixed in that regard yet, I think the observed 'more stable behavior' was due to less load during running the test (or it might be more powerful machines where the test has been run recently). -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Todd Lipcon has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: What's the verdict on this? It seems like the test is no longer as flaky as it was last week. Did we fix something? -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Adar Dembo has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG Commit Message: > OK, here the result running without and with 21b0f3d5 changelist: I can buy that the new approach to failure detection could require rejiggering of test parameters in order to find the new boundaries. However, if the logs/timings show that election convergence is net _less efficient_ than it was before that change, then it'd be better to treat that as a bug and figure out how to fix that than it'd be to rejigger the boundary conditions in this one test. The main question is whether election convergence is "worse" or just "different". If the latter, then I agree with you that we should just tweak the timings in this test. But if the former, then we should address that directly. -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Alexey Serbin has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG Commit Message: > Did you try looping this test with the recent failure-detection change (htt OK, here the result running without and with 21b0f3d5 changelist: Without the changelist (HEAD is at c8e04077), --stress-cpu-threads=16, 0/1024 failed: http://dist-test.cloudera.org//job?job_id=aserbin.150515 9852.20744 With the changelist (HEAD is at 21b0f3d5), --stress-cpu-threads=16, at least 7/1024 failed: http://dist-test.cloudera.org//job?job_id=aserbin.1505160964.1750 Could you clarify on what do you want to address in this regard? As I understand, the test was built to induce many re-elections among masters, and the parameters were set so the process was converging more or less in the specified timeout intervals. With the new way of sending heartbeats and doing master failure detection, it seems the masters sometimes were not fast enough to handle Raft HBs as fast as they used to be. But it's all about 'boundary' conditions, as I understand. -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Andrew Wong has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc File src/kudu/integration-tests/catalog_manager_tsk-itest.cc: PS1, Line 84: // Add master-only flags. Someone newly reading through this test might not understand why all these flags are necessary without reading the commit msg. Could you comment with a high-level statement explaining what the desired behavior of the master is? PS1, Line 98: // Add tserver-only flags. Same here. -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Alexey Serbin has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc File src/kudu/integration-tests/catalog_manager_tsk-itest.cc: Line 64: hb_interval_ms_(128), > In this test we want to induce many elections among masters, so that electi And the more re-election we have among masters, the better. That's why the Raft HB interval is set to those just tens/hundreds of milliseconds. Another approach might be setting --catalog_manager_inject_latency_prior_tsk_write_ms=1 and using the default Raft HB interval of 1 second, but that would require longer test runtime to get the same number of master re-elections during the test. -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Alexey Serbin has posted comments on this change. Change subject: [tests] de-flaking catalog_manager_tsk-itest .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG Commit Message: > Did you try looping this test with the recent failure-detection change (htt I didn't try but I know it was not that flaky prior to that patch. I can double-check and report on that. Actually, I suspect there were 2 changelists that made this test flakier: this one and another one committed between 2 and 4 of September. I can dig in to find the exact ones. http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc File src/kudu/integration-tests/catalog_manager_tsk-itest.cc: Line 64: hb_interval_ms_(128), > I get the feeling that, although these values may now be carefully tuned so In this test we want to induce many elections among masters, so that elections happen while a leader tries to write some data into the system catalog table (particularly, a new token signing key). Adding that --catalog_manager_inject_latency_prior_tsk_write_ms=1000 flag and making the Raft HB interval less than that 1000ms interval (along with disabling pre-elections) gives us the desired behavior. -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] [tests] de-flaking catalog manager tsk-itest
Alexey Serbin has uploaded a new change for review. http://gerrit.cloudera.org:8080/8017 Change subject: [tests] de-flaking catalog_manager_tsk-itest .. [tests] de-flaking catalog_manager_tsk-itest After recent updates the catalog_manager_tsk-itest became unstable. One of the failure scenarios is when the test tablet server cannot register in the cluster within the specified timeout (30 seconds). It seems some test machines are too slow to accommodate the test scenario with 16ms Raft heartbeat interval. When running the test with too short Raft heartbeat interval, the following scenario occurs on slow or very busy machines: * An election happens among masters (e.g., term 1) and leader master is elected. * Shortly after that, the followers stop receiving some Raft heartbeats from the leader within the specified timeout interval. * The followers start new election, but experience timeouts for vote requests among them as well. * The leader fails getting responses from the followers for its UpdateConsensus RPC requests. * The tablet server fails to register with the cluster. Sometimes the scenario above is enriched with dropped incoming Raft requests due to the backpressure on the Raft RPC service queue in masters. The following changes where made to address the flakiness due to the described scenarios: * increasing the Raft heartbeat interval * increasing max length of the Raft RPC service queue * increasing the back-off interval after leader election failures After making the changes above, the test became more stable. Not a single failure was spot in multiple 1K runs when running by dist-test with --stress_cpu_threads=16: ASAN: http://dist-test.cloudera.org/job?job_id=aserbin.1504914035.18113 DEBUG: http://dist-test.cloudera.org/job?job_id=aserbin.1504911962.26895 RELEASE: http://dist-test.cloudera.org/job?job_id=aserbin.1504913524.8185 TSAN: http://dist-test.cloudera.org/job?job_id=aserbin.1504903775.17126 This is a follow-up for faa0b14effb6e15f9989d686e5a1f8e1040a1dd6. Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb --- M src/kudu/integration-tests/catalog_manager_tsk-itest.cc 1 file changed, 5 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8017/1 -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin