[kudu-CR] [tests] fixed flake in consensus peer health status

2018-04-30 Thread Alexey Serbin (Code Review)
Alexey Serbin has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/10237 )

Change subject: [tests] fixed flake in consensus_peer_health_status
..

[tests] fixed flake in consensus_peer_health_status

Fixed flake in the TestPeerHealthStatusTransitions scenario of the
ConsensusPeerHealthStatusITest test.  Prior to the fix, the flakiness
happened when the target tablet server was shutdown during an on-going
tablet copy, where the tablet copy was initiated by AddServer() call
while preparing the mini-cluster for the peer health sequence of
"HEALTHY -> UNKNOWN -> FAILED -> FAILED_UNRECOVERABLE".
In that situation, the source tablet server had corresponding WAL
segments anchored, so they could not be GCed.  As a result, the tablet
replica would not get FAILED_UNRECOVERABLE health status in 30 seconds
because --tablet_copy_idle_timeout_sec is set to 600 seconds by default.

Test results using dist-test, before and after the fix (ASAN),
before:
  http://dist-test.cloudera.org/job?job_id=aserbin.1525111507.120645

after:
  http://dist-test.cloudera.org/job?job_id=aserbin.1525099526.63998

Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629
Reviewed-on: http://gerrit.cloudera.org:8080/10237
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy 
---
M src/kudu/integration-tests/consensus_peer_health_status-itest.cc
1 file changed, 6 insertions(+), 2 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/10237
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629
Gerrit-Change-Number: 10237
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [tests] fixed flake in consensus peer health status

2018-04-30 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10237 )

Change subject: [tests] fixed flake in consensus_peer_health_status
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/10237
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629
Gerrit-Change-Number: 10237
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 30 Apr 2018 22:50:17 +
Gerrit-HasComments: No


[kudu-CR] [tests] fixed flake in consensus peer health status

2018-04-30 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10237 )

Change subject: [tests] fixed flake in consensus_peer_health_status
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10237/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10237/1//COMMIT_MSG@11
PS1, Line 11: happened when the target tablet server was shutdown during an 
on-going
: tablet copy.  In that situation, the source tablet server had
: c
curious why you took this approach instead of just waiting for the tablet copy 
to finish? The tablet copy here is caused by the prior part of the test, right?



--
To view, visit http://gerrit.cloudera.org:8080/10237
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629
Gerrit-Change-Number: 10237
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 30 Apr 2018 16:41:02 +
Gerrit-HasComments: Yes


[kudu-CR] [tests] fixed flake in consensus peer health status

2018-04-30 Thread Alexey Serbin (Code Review)
Alexey Serbin has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/10237


Change subject: [tests] fixed flake in consensus_peer_health_status
..

[tests] fixed flake in consensus_peer_health_status

Fixed flake in the TestPeerHealthStatusTransitions scenario of the
ConsensusPeerHealthStatusITest test.  Prior to the fix, the flakiness
happened when the target tablet server was shutdown during an on-going
tablet copy.  In that situation, the source tablet server had
corresponding WAL segments anchored, so they could not be GCed.
As a result, the tablet replica would not get FAILED_UNRECOVERABLE
health status assigned.

Test results using dist-test, before and after the fix (ASAN),
before:
  http://dist-test.cloudera.org/job?job_id=aserbin.1525098763.54471

after:
  http://dist-test.cloudera.org/job?job_id=aserbin.1525099526.63998

Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629
---
M src/kudu/integration-tests/consensus_peer_health_status-itest.cc
1 file changed, 1 insertion(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/37/10237/1
--
To view, visit http://gerrit.cloudera.org:8080/10237
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8eb640604e98361029aa3342ffa3050e922b6629
Gerrit-Change-Number: 10237
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin