Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8017

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................

[tests] de-flaking catalog_manager_tsk-itest

After recent updates the catalog_manager_tsk-itest became unstable.
One of the failure scenarios is when the test tablet server cannot
register in the cluster within the specified timeout (30 seconds).

It seems some test machines are too slow to accommodate the test
scenario with 16ms Raft heartbeat interval.  When running the test
with too short Raft heartbeat interval, the following scenario occurs
on slow or very busy machines:

* An election happens among masters (e.g., term 1) and leader master
  is elected.

* Shortly after that, the followers stop receiving some Raft heartbeats
  from the leader within the specified timeout interval.

* The followers start new election, but experience timeouts for vote
  requests among them as well.

* The leader fails getting responses from the followers for its
  UpdateConsensus RPC requests.

* The tablet server fails to register with the cluster.

Sometimes the scenario above is enriched with dropped incoming
Raft requests due to the backpressure on the Raft RPC service queue
in masters.

The following changes where made to address the flakiness due
to the described scenarios:
  * increasing the Raft heartbeat interval
  * increasing max length of the Raft RPC service queue
  * increasing the back-off interval after leader election failures

After making the changes above, the test became more stable.  Not
a single failure was spot in multiple 1K runs when running by dist-test
with --stress_cpu_threads=16:

ASAN:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504914035.18113

DEBUG:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504911962.26895

RELEASE:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504913524.8185

TSAN:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504903775.17126

This is a follow-up for faa0b14effb6e15f9989d686e5a1f8e1040a1dd6.

Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
---
M src/kudu/integration-tests/catalog_manager_tsk-itest.cc
1 file changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8017/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>

Reply via email to