Alexey Serbin has uploaded a new change for review. http://gerrit.cloudera.org:8080/8017
Change subject: [tests] de-flaking catalog_manager_tsk-itest ...................................................................... [tests] de-flaking catalog_manager_tsk-itest After recent updates the catalog_manager_tsk-itest became unstable. One of the failure scenarios is when the test tablet server cannot register in the cluster within the specified timeout (30 seconds). It seems some test machines are too slow to accommodate the test scenario with 16ms Raft heartbeat interval. When running the test with too short Raft heartbeat interval, the following scenario occurs on slow or very busy machines: * An election happens among masters (e.g., term 1) and leader master is elected. * Shortly after that, the followers stop receiving some Raft heartbeats from the leader within the specified timeout interval. * The followers start new election, but experience timeouts for vote requests among them as well. * The leader fails getting responses from the followers for its UpdateConsensus RPC requests. * The tablet server fails to register with the cluster. Sometimes the scenario above is enriched with dropped incoming Raft requests due to the backpressure on the Raft RPC service queue in masters. The following changes where made to address the flakiness due to the described scenarios: * increasing the Raft heartbeat interval * increasing max length of the Raft RPC service queue * increasing the back-off interval after leader election failures After making the changes above, the test became more stable. Not a single failure was spot in multiple 1K runs when running by dist-test with --stress_cpu_threads=16: ASAN: http://dist-test.cloudera.org/job?job_id=aserbin.1504914035.18113 DEBUG: http://dist-test.cloudera.org/job?job_id=aserbin.1504911962.26895 RELEASE: http://dist-test.cloudera.org/job?job_id=aserbin.1504913524.8185 TSAN: http://dist-test.cloudera.org/job?job_id=aserbin.1504903775.17126 This is a follow-up for faa0b14effb6e15f9989d686e5a1f8e1040a1dd6. Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb --- M src/kudu/integration-tests/catalog_manager_tsk-itest.cc 1 file changed, 5 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8017/1 -- To view, visit http://gerrit.cloudera.org:8080/8017 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>