Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/7887 to look at the new patch set (#2). Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader ...................................................................... WIP [raft_consensus-itest] fix flake in TestSlowLeader Under rare conditions, a reader thread of the test workload of the RaftConsensusITest.TestSlowLeader test might read data from a lagging follower replica, while the writer threads just switched to a newly elected leader replica. When that happened, the test failed with a stack trace like the following: F0829 01:47:51.052803 1605 test_workload.cc:230] Check failed: \ row_count >= expected_row_count (219550 vs. 249850) *** Check failure stack trace: *** @ 0x95e83d google::LogMessage::Fail() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1488 @ 0x9606fd google::LogMessage::SendToLog() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1442 @ 0x95e379 google::LogMessage::Flush() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1312 @ 0x96119f google::LogMessageFatal::~LogMessageFatal() \ at thirdparty/src/glog-0.3.5/src/logging.cc:2024 @ 0x955809 kudu::TestWorkload::ReadThread() \ at tr1/shared_ptr.h:340 @ 0x7fbdbc108a40 (unknown) at ??:0 @ 0x7fbdbd550184 start_thread at ??:0 @ 0x7fbdbbb7637d clone at ??:0 @ (nil) (unknown) This patch addresses the issue. Basically, it works around the RYW consistency issue which is seen here because of the absense of the leader leases mechanism. To test the modifications, I run the test 2K times multiple times, none of those failed (RELEASE build): http://dist-test.cloudera.org//job?job_id=aserbin.1504051374.17423 To run the test without the work-around, I applied the patch below and get 1 out of 2K failed: http://dist-test.cloudera.org//job?job_id=aserbin.1504053764.1127 WIP: because I'm not sure whether we want this workaround now or we would better wait for the leader leases to be implemented. ------------------------------------------------------------------------ --- a/src/kudu/integration-tests/raft_consensus-itest.cc +++ b/src/kudu/integration-tests/raft_consensus-itest.cc @@ -2571,7 +2571,7 @@ TEST_F(RaftConsensusITest, TestSlowLeader) { if (!AllowSlowTests()) return; static const int kHbIntervalMs = 32; - static const int kMaxMissedHbPeriods = 3; + static const int kMaxMissedHbPeriods = 1; const vector<string> tserver_flags = { Substitute("--raft_heartbeat_interval_ms=$0", kHbIntervalMs), Substitute("--leader_failure_max_missed_heartbeat_periods=$0", @@ -2586,9 +2586,9 @@ TEST_F(RaftConsensusITest, TestSlowLeader) { TestWorkload workload(cluster_.get()); workload.set_table_name(kTableId); workload.set_num_read_threads(2); - workload.set_read_retry_enabled(true); - workload.set_read_retry_delay( - MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); + //workload.set_read_retry_enabled(true); + //workload.set_read_retry_delay( + // MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); workload.Setup(); workload.Start(); SleepFor(MonoDelta::FromSeconds(60)); ------------------------------------------------------------------------ Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca --- M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/integration-tests/test_workload.h 3 files changed, 43 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/7887/2 -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot