[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-29 Thread Alexey Serbin (Code Review)
Alexey Serbin has abandoned this change. ( http://gerrit.cloudera.org:8080/8017 
)

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Abandoned

Since the fix for KUDU-2149 has been committed, this test became stable even 
with its original settings.
--
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-Change-Number: 8017
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-14 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

> What's the verdict on this? It seems like the test is no longer as
 > flaky as it was last week. Did we fix something?

I was thinking to take a closer look at the reason behind this test starting 
being flaky since the mentioned changelist 21b0f3d5, but I haven't done that 
yet.

As for the less flaky observed for this test, nothing has been fixed in that 
regard yet, I think the observed 'more stable behavior' was due to less load 
during running the test (or it might be more powerful machines where the test 
has been run recently).

-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-14 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

What's the verdict on this? It seems like the test is no longer as flaky as it 
was last week. Did we fix something?

-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

> OK, here the result running without and with 21b0f3d5 changelist:
I can buy that the new approach to failure detection could require rejiggering 
of test parameters in order to find the new boundaries.

However, if the logs/timings show that election convergence is net _less 
efficient_ than it was before that change, then it'd be better to treat that as 
a bug and figure out how to fix that than it'd be to rejigger the boundary 
conditions in this one test.

The main question is whether election convergence is "worse" or just 
"different". If the latter, then I agree with you that we should just tweak the 
timings in this test. But if the former, then we should address that directly.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-11 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

> Did you try looping this test with the recent failure-detection change (htt
OK, here the result running without and with 21b0f3d5 changelist:

Without the changelist (HEAD is at c8e04077), --stress-cpu-threads=16, 0/1024 
failed:
  http://dist-test.cloudera.org//job?job_id=aserbin.150515
9852.20744

With the changelist (HEAD is at 21b0f3d5), --stress-cpu-threads=16, at least 
7/1024 failed:
  http://dist-test.cloudera.org//job?job_id=aserbin.1505160964.1750


Could you clarify on what do you want to address in this regard?

As I understand, the test was built to induce many re-elections among masters, 
and the parameters were set so the process was converging more or less in the 
specified timeout intervals.  With the new way of sending heartbeats and doing 
master failure detection, it seems the masters sometimes were not fast enough 
to handle Raft HBs as fast as they used to be.  But it's all about 'boundary' 
conditions, as I understand.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-08 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS1, Line 84: // Add master-only flags.
Someone newly reading through this test might not understand why all these 
flags are necessary without reading the commit msg. Could you comment with a 
high-level statement explaining what the desired behavior of the master is?


PS1, Line 98: // Add tserver-only flags.
Same here.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-08 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 64: hb_interval_ms_(128),
> In this test we want to induce many elections among masters, so that electi
And the more re-election we have among masters, the better.  That's why the 
Raft HB interval is set to those just tens/hundreds of milliseconds.

Another approach might be setting 
--catalog_manager_inject_latency_prior_tsk_write_ms=1 and using the default 
Raft HB interval of 1 second, but that would require longer test runtime to get 
the same number of master re-elections during the test.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-08 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

> Did you try looping this test with the recent failure-detection change (htt
I didn't try but I know it was not that flaky prior to that patch.

I can double-check and report on that.

Actually, I suspect there were 2 changelists that made this test flakier: this 
one and another one committed between 2 and 4 of September.  I can dig in to 
find the exact ones.


http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 64: hb_interval_ms_(128),
> I get the feeling that, although these values may now be carefully tuned so
In this test we want to induce many elections among masters, so that elections 
happen while a leader tries to write some data into the system catalog table 
(particularly, a new token signing key).  Adding that 
--catalog_manager_inject_latency_prior_tsk_write_ms=1000 flag and making the 
Raft HB interval less than that 1000ms interval (along with disabling 
pre-elections) gives us the desired behavior.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [tests] de-flaking catalog manager tsk-itest

2017-09-08 Thread Alexey Serbin (Code Review)
Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8017

Change subject: [tests] de-flaking catalog_manager_tsk-itest
..

[tests] de-flaking catalog_manager_tsk-itest

After recent updates the catalog_manager_tsk-itest became unstable.
One of the failure scenarios is when the test tablet server cannot
register in the cluster within the specified timeout (30 seconds).

It seems some test machines are too slow to accommodate the test
scenario with 16ms Raft heartbeat interval.  When running the test
with too short Raft heartbeat interval, the following scenario occurs
on slow or very busy machines:

* An election happens among masters (e.g., term 1) and leader master
  is elected.

* Shortly after that, the followers stop receiving some Raft heartbeats
  from the leader within the specified timeout interval.

* The followers start new election, but experience timeouts for vote
  requests among them as well.

* The leader fails getting responses from the followers for its
  UpdateConsensus RPC requests.

* The tablet server fails to register with the cluster.

Sometimes the scenario above is enriched with dropped incoming
Raft requests due to the backpressure on the Raft RPC service queue
in masters.

The following changes where made to address the flakiness due
to the described scenarios:
  * increasing the Raft heartbeat interval
  * increasing max length of the Raft RPC service queue
  * increasing the back-off interval after leader election failures

After making the changes above, the test became more stable.  Not
a single failure was spot in multiple 1K runs when running by dist-test
with --stress_cpu_threads=16:

ASAN:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504914035.18113

DEBUG:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504911962.26895

RELEASE:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504913524.8185

TSAN:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504903775.17126

This is a follow-up for faa0b14effb6e15f9989d686e5a1f8e1040a1dd6.

Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
---
M src/kudu/integration-tests/catalog_manager_tsk-itest.cc
1 file changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8017/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin