[kudu-CR] [tests] fixed flake in consensus peer health status
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10237 ) Change subject: [tests] fixed flake in consensus_peer_health_status .. [tests] fixed flake in consensus_peer_health_status Fixed flake in the TestPeerHealthStatusTransitions scenario of the ConsensusPeerHealthStatusITest test. Prior to the fix, the flakiness happened when the target tablet server was shutdown during an on-going tablet copy, where the tablet copy was initiated by AddServer() call while preparing the mini-cluster for the peer health sequence of "HEALTHY -> UNKNOWN -> FAILED -> FAILED_UNRECOVERABLE". In that situation, the source tablet server had corresponding WAL segments anchored, so they could not be GCed. As a result, the tablet replica would not get FAILED_UNRECOVERABLE health status in 30 seconds because --tablet_copy_idle_timeout_sec is set to 600 seconds by default. Test results using dist-test, before and after the fix (ASAN), before: http://dist-test.cloudera.org/job?job_id=aserbin.1525111507.120645 after: http://dist-test.cloudera.org/job?job_id=aserbin.1525099526.63998 Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629 Reviewed-on: http://gerrit.cloudera.org:8080/10237 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy--- M src/kudu/integration-tests/consensus_peer_health_status-itest.cc 1 file changed, 6 insertions(+), 2 deletions(-) Approvals: Kudu Jenkins: Verified Mike Percy: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/10237 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629 Gerrit-Change-Number: 10237 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [tests] fixed flake in consensus peer health status
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10237 ) Change subject: [tests] fixed flake in consensus_peer_health_status .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/10237 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629 Gerrit-Change-Number: 10237 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 30 Apr 2018 22:50:17 + Gerrit-HasComments: No
[kudu-CR] [tests] fixed flake in consensus peer health status
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10237 ) Change subject: [tests] fixed flake in consensus_peer_health_status .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/10237/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10237/1//COMMIT_MSG@11 PS1, Line 11: happened when the target tablet server was shutdown during an on-going : tablet copy. In that situation, the source tablet server had : c curious why you took this approach instead of just waiting for the tablet copy to finish? The tablet copy here is caused by the prior part of the test, right? -- To view, visit http://gerrit.cloudera.org:8080/10237 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629 Gerrit-Change-Number: 10237 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 30 Apr 2018 16:41:02 + Gerrit-HasComments: Yes
[kudu-CR] [tests] fixed flake in consensus peer health status
Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/10237 Change subject: [tests] fixed flake in consensus_peer_health_status .. [tests] fixed flake in consensus_peer_health_status Fixed flake in the TestPeerHealthStatusTransitions scenario of the ConsensusPeerHealthStatusITest test. Prior to the fix, the flakiness happened when the target tablet server was shutdown during an on-going tablet copy. In that situation, the source tablet server had corresponding WAL segments anchored, so they could not be GCed. As a result, the tablet replica would not get FAILED_UNRECOVERABLE health status assigned. Test results using dist-test, before and after the fix (ASAN), before: http://dist-test.cloudera.org/job?job_id=aserbin.1525098763.54471 after: http://dist-test.cloudera.org/job?job_id=aserbin.1525099526.63998 Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629 --- M src/kudu/integration-tests/consensus_peer_health_status-itest.cc 1 file changed, 1 insertion(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/37/10237/1 -- To view, visit http://gerrit.cloudera.org:8080/10237 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629 Gerrit-Change-Number: 10237 Gerrit-PatchSet: 1 Gerrit-Owner: Alexey Serbin