[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7887 ) Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/7887/2/src/kudu/integration-tests/test_workload.cc File src/kudu/integration-tests/test_workload.cc: http://gerrit.cloudera.org:8080/#/c/7887/2/src/kudu/integration-tests/test_workload.cc@220 PS2, Line 220: while (true) { instead what if we just used READ_AT_SNAPSHOT scan mode and propagated timestamps back from the writer threads? that way we'd actually get coverage of our snapshot scan code paths. -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-Change-Number: 7887 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 06 Oct 2017 01:29:24 + Gerrit-HasComments: Yes
[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7887 ) Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. Patch Set 2: I think this change makes sense but I am wondering what the purpose of the read thread is in the test, anyway. The original test doesn't do a good job of justifying its existence or implementation in my opinion. I am honestly surprised it is not more flaky than reported here. -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-Change-Number: 7887 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Comment-Date: Mon, 02 Oct 2017 21:51:57 + Gerrit-HasComments: No
[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
Alexey Serbin has posted comments on this change. Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. Patch Set 2: > can you explain a little better the exact scenario that happens? > could you cause it deterministically or semi-determinically? could > you write a disabled test for it? Thank you for looking at the patch. So far I have no better explanation than in the changelist description. Right now the failure happens semi-deterministically if running this test (1 out of 2K fails). Yes, I think it would be possible to create a targeted test for that -- while doing so, I could produce a better explanation for the failure path scenario. -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
David Ribeiro Alves has posted comments on this change. Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. Patch Set 2: can you explain a little better the exact scenario that happens? could you cause it deterministically or semi-determinically? could you write a disabled test for it? -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
Alexey Serbin has posted comments on this change. Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/7887/1/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 2591: MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); > warning: either cast from 'int' to 'int64_t' (aka 'long') is ineffective, o Done http://gerrit.cloudera.org:8080/#/c/7887/1/src/kudu/integration-tests/test_workload.h File src/kudu/integration-tests/test_workload.h: Line 121: read_retry_delay_ = std::move(delay); > warning: std::move of the variable 'delay' of the trivially-copyable type ' Done -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7887 to look at the new patch set (#2). Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. WIP [raft_consensus-itest] fix flake in TestSlowLeader Under rare conditions, a reader thread of the test workload of the RaftConsensusITest.TestSlowLeader test might read data from a lagging follower replica, while the writer threads just switched to a newly elected leader replica. When that happened, the test failed with a stack trace like the following: F0829 01:47:51.052803 1605 test_workload.cc:230] Check failed: \ row_count >= expected_row_count (219550 vs. 249850) *** Check failure stack trace: *** @ 0x95e83d google::LogMessage::Fail() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1488 @ 0x9606fd google::LogMessage::SendToLog() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1442 @ 0x95e379 google::LogMessage::Flush() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1312 @ 0x96119f google::LogMessageFatal::~LogMessageFatal() \ at thirdparty/src/glog-0.3.5/src/logging.cc:2024 @ 0x955809 kudu::TestWorkload::ReadThread() \ at tr1/shared_ptr.h:340 @ 0x7fbdbc108a40 (unknown) at ??:0 @ 0x7fbdbd550184 start_thread at ??:0 @ 0x7fbdbbb7637d clone at ??:0 @ (nil) (unknown) This patch addresses the issue. Basically, it works around the RYW consistency issue which is seen here because of the absense of the leader leases mechanism. To test the modifications, I run the test 2K times multiple times, none of those failed (RELEASE build): http://dist-test.cloudera.org//job?job_id=aserbin.1504051374.17423 To run the test without the work-around, I applied the patch below and get 1 out of 2K failed: http://dist-test.cloudera.org//job?job_id=aserbin.1504053764.1127 WIP: because I'm not sure whether we want this workaround now or we would better wait for the leader leases to be implemented. --- a/src/kudu/integration-tests/raft_consensus-itest.cc +++ b/src/kudu/integration-tests/raft_consensus-itest.cc @@ -2571,7 +2571,7 @@ TEST_F(RaftConsensusITest, TestSlowLeader) { if (!AllowSlowTests()) return; static const int kHbIntervalMs = 32; - static const int kMaxMissedHbPeriods = 3; + static const int kMaxMissedHbPeriods = 1; const vector tserver_flags = { Substitute("--raft_heartbeat_interval_ms=$0", kHbIntervalMs), Substitute("--leader_failure_max_missed_heartbeat_periods=$0", @@ -2586,9 +2586,9 @@ TEST_F(RaftConsensusITest, TestSlowLeader) { TestWorkload workload(cluster_.get()); workload.set_table_name(kTableId); workload.set_num_read_threads(2); - workload.set_read_retry_enabled(true); - workload.set_read_retry_delay( - MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); + //workload.set_read_retry_enabled(true); + //workload.set_read_retry_delay( + //MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); workload.Setup(); workload.Start(); SleepFor(MonoDelta::FromSeconds(60)); Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca --- M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/integration-tests/test_workload.h 3 files changed, 43 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/7887/2 -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader
Alexey Serbin has uploaded a new change for review. http://gerrit.cloudera.org:8080/7887 Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader .. WIP [raft_consensus-itest] fix flake in TestSlowLeader Under rare conditions, a reader thread of the test workload of the RaftConsensusITest.TestSlowLeader test might read data from a lagging follower replica, while the writer threads just switched to a newly elected leader replica. When that happened, the test failed with a stack trace like the following: F0829 01:47:51.052803 1605 test_workload.cc:230] Check failed: \ row_count >= expected_row_count (219550 vs. 249850) *** Check failure stack trace: *** @ 0x95e83d google::LogMessage::Fail() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1488 @ 0x9606fd google::LogMessage::SendToLog() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1442 @ 0x95e379 google::LogMessage::Flush() \ at thirdparty/src/glog-0.3.5/src/logging.cc:1312 @ 0x96119f google::LogMessageFatal::~LogMessageFatal() \ at thirdparty/src/glog-0.3.5/src/logging.cc:2024 @ 0x955809 kudu::TestWorkload::ReadThread() \ at tr1/shared_ptr.h:340 @ 0x7fbdbc108a40 (unknown) at ??:0 @ 0x7fbdbd550184 start_thread at ??:0 @ 0x7fbdbbb7637d clone at ??:0 @ (nil) (unknown) This patch addresses the issue. Basically, it works around the RYW consistency issue which is seen here because of the absense of the leader leases mechanism. To test the modifications, I run the test 2K times multiple times, none of those failed (RELEASE build): http://dist-test.cloudera.org//job?job_id=aserbin.1504051374.17423 To run the test without the work-around, I applied the patch below and get 1 out of 2K failed: http://dist-test.cloudera.org//job?job_id=aserbin.1504053764.1127 WIP: because I'm not sure whether we want this workaround now or we would better wait for the leader leases to be implemented. --- a/src/kudu/integration-tests/raft_consensus-itest.cc +++ b/src/kudu/integration-tests/raft_consensus-itest.cc @@ -2571,7 +2571,7 @@ TEST_F(RaftConsensusITest, TestSlowLeader) { if (!AllowSlowTests()) return; static const int kHbIntervalMs = 32; - static const int kMaxMissedHbPeriods = 3; + static const int kMaxMissedHbPeriods = 1; const vector tserver_flags = { Substitute("--raft_heartbeat_interval_ms=$0", kHbIntervalMs), Substitute("--leader_failure_max_missed_heartbeat_periods=$0", @@ -2586,9 +2586,9 @@ TEST_F(RaftConsensusITest, TestSlowLeader) { TestWorkload workload(cluster_.get()); workload.set_table_name(kTableId); workload.set_num_read_threads(2); - workload.set_read_retry_enabled(true); - workload.set_read_retry_delay( - MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); + //workload.set_read_retry_enabled(true); + //workload.set_read_retry_delay( + //MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods)); workload.Setup(); workload.Start(); SleepFor(MonoDelta::FromSeconds(60)); Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca --- M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/integration-tests/test_workload.h 3 files changed, 43 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/7887/1 -- To view, visit http://gerrit.cloudera.org:8080/7887 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin