[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Todd Lipcon has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Todd Lipcon has submitted this change and it was merged. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Fix flaky test TestRecoverFromOpIdOverflow (again) The previous attempt to fix this in commit f0580499dc50e8a47ff6251301cdc15b9b79edcb had a flaw, but this test really does fix the primary source of the flakiness. What appears to have happened in the previous attempt is the dist-test passed and then I made a couple additional tweaks before committing it which actually broke it again. The only "real" code change (the aforementioned fix) is on lines L367-L371, however while I was in this test I also "modernized" it a bit by making it inherit from ExternalMiniClusterITestBase which resulted in a net-negative line count in this patch. I ran the current version of this patch on dist-test in DEBUG mode with 8 cpu stress threads, and 199/200 passed (there is a nearly 50% failure rate with 8 stress threads without this fix). The one that failed actually timed out (with no logs, so I have no idea what went wrong) but it is likely some unrelated (infrastructure?) issue. This was the dist-test job: http://dist-test.cloudera.org/job?job_id=mpercy.1495321732.3266 Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Reviewed-on: http://gerrit.cloudera.org:8080/6943 Tested-by: Kudu Jenkins Reviewed-by: Todd Lipcon --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 23 insertions(+), 48 deletions(-) Approvals: Todd Lipcon: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Mike Percy has uploaded a new patch set (#3). Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Fix flaky test TestRecoverFromOpIdOverflow (again) The previous attempt to fix this in commit f0580499dc50e8a47ff6251301cdc15b9b79edcb had a flaw, but this test really does fix the primary source of the flakiness. What appears to have happened in the previous attempt is the dist-test passed and then I made a couple additional tweaks before committing it which actually broke it again. The only "real" code change (the aforementioned fix) is on lines L367-L371, however while I was in this test I also "modernized" it a bit by making it inherit from ExternalMiniClusterITestBase which resulted in a net-negative line count in this patch. I ran the current version of this patch on dist-test in DEBUG mode with 8 cpu stress threads, and 199/200 passed (there is a nearly 50% failure rate with 8 stress threads without this fix). The one that failed actually timed out (with no logs, so I have no idea what went wrong) but it is likely some unrelated (infrastructure?) issue. This was the dist-test job: http://dist-test.cloudera.org/job?job_id=mpercy.1495321732.3266 Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 23 insertions(+), 48 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/6943/3 -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Mike Percy has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/6943/2/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS2, Line 75: const int kOneReplica = 1; > Don't think this adds much value. If the goal was to make the parameter to Done -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Todd Lipcon has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/6943/2/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS2, Line 75: const int kOneReplica = 1; Don't think this adds much value. If the goal was to make the parameter to StartCluster more obvious, I think it's better to just do: StartCluster(extra_ts_flags, {}, /*replicas=*/ 1); or something of that nature -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Mike Percy has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Patch Set 2: Verified+1 Overriding failure due to known flaky -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6943 to look at the new patch set (#2). Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Fix flaky test TestRecoverFromOpIdOverflow (again) The previous attempt to fix this in commit f0580499dc50e8a47ff6251301cdc15b9b79edcb had a flaw, but this test really does fix the primary source of the flakiness. What appears to have happened in the previous attempt is the dist-test passed and then I made a couple additional tweaks before committing it which actually broke it again. The only "real" code change (the aforementioned fix) is on lines L367-L371, however while I was in this test I also "modernized" it a bit by making it inherit from ExternalMiniClusterITestBase which resulted in a net-negative line count in this patch. I ran the current version of this patch on dist-test in DEBUG mode with 8 cpu stress threads, and 199/200 passed (there is a nearly 50% failure rate with 8 stress threads without this fix). The one that failed actually timed out (with no logs, so I have no idea what went wrong) but it is likely some unrelated (infrastructure?) issue. This was the dist-test job: http://dist-test.cloudera.org/job?job_id=mpercy.1495321732.3266 Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 30 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/6943/2 -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Hello David Ribeiro Alves, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6943 to review the following change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Fix flaky test TestRecoverFromOpIdOverflow (again) The previous attempt to fix this in commit f0580499dc50e8a47ff6251301cdc15b9b79edcb had a flaw, but this test really does fix the primary source of the flakiness. What appears to have happened in the previous attempt is the dist-test passed and then I made a couple additional tweaks before committing it which actually broke it again. The only "real" code change (the aforementioned fix) is on lines L367-L371, however while I was in this test I also "modernized" it a bit by making it inherit from ExternalMiniClusterITestBase which resulted in a net-negative line count in this patch. I ran the current version of this patch on dist-test in DEBUG mode with 8 cpu stress threads, and 199/200 passed (there is a nearly 50% failure rate with 8 stress threads without this fix). The one that failed actually timed out (with no logs, so I have no idea what went wrong) but it is likely some unrelated (infrastructure?) issue. This was the dist-test job: http://dist-test.cloudera.org/job?job_id=mpercy.1495321732.3266 Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 29 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/6943/1 -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
David Ribeiro Alves has submitted this change and it was merged. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Fix flaky test TestRecoverFromOpIdOverflow This test is flaky because we race against the COMMIT message for the first NO_OP in the WAL being written. It is currently hard to know when the actual COMMIT message is written to the WAL so we use a workaround to delete the first log segment before restarting the EMC in this test. If the COMMIT doesn't get written in time then the tablet bootstrap process doesn't prune that entry at startup time and it ends up in the new log prior to the much higher-numbered log entries. The flaky test failure was due to an out of order log index being detected, and looked like the following error in the log: F0504 21:28:21.128690 13908 raft_consensus_state.cc:502] Check failed: _s.ok() Bad status: Corruption: New operation's index does not follow the previous op's index. Current: 2147483648.2147483648. Previous: 1.1 *** Check failure stack trace: *** @ 0x7fe0adef915d google::LogMessage::Fail() at ??:0 @ 0x7fe0adefb05d google::LogMessage::SendToLog() at ??:0 @ 0x7fe0adef8c99 google::LogMessage::Flush() at ??:0 @ 0x7fe0adefbaff google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe0b57a8018 kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7fe0b57848a5 kudu::consensus::RaftConsensus::NotifyCommitIndex() at ??:0 @ 0x7fe0b5749ccf kudu::consensus::PeerMessageQueue::NotifyObserversOfCommitIndexChangeTask() at ??:0 @ 0x7fe0b575bdc1 kudu::internal::RunnableAdapter<>::Run() at ??:0 Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Reviewed-on: http://gerrit.cloudera.org:8080/6808 Tested-by: Kudu Jenkins Reviewed-by: David Ribeiro Alves --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 28 insertions(+), 0 deletions(-) Approvals: David Ribeiro Alves: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
David Ribeiro Alves has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
Mike Percy has uploaded a new patch set (#3). Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Fix flaky test TestRecoverFromOpIdOverflow This test is flaky because we race against the COMMIT message for the first NO_OP in the WAL being written. It is currently hard to know when the actual COMMIT message is written to the WAL so we use a workaround to delete the first log segment before restarting the EMC in this test. If the COMMIT doesn't get written in time then the tablet bootstrap process doesn't prune that entry at startup time and it ends up in the new log prior to the much higher-numbered log entries. The flaky test failure was due to an out of order log index being detected, and looked like the following error in the log: F0504 21:28:21.128690 13908 raft_consensus_state.cc:502] Check failed: _s.ok() Bad status: Corruption: New operation's index does not follow the previous op's index. Current: 2147483648.2147483648. Previous: 1.1 *** Check failure stack trace: *** @ 0x7fe0adef915d google::LogMessage::Fail() at ??:0 @ 0x7fe0adefb05d google::LogMessage::SendToLog() at ??:0 @ 0x7fe0adef8c99 google::LogMessage::Flush() at ??:0 @ 0x7fe0adefbaff google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe0b57a8018 kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7fe0b57848a5 kudu::consensus::RaftConsensus::NotifyCommitIndex() at ??:0 @ 0x7fe0b5749ccf kudu::consensus::PeerMessageQueue::NotifyObserversOfCommitIndexChangeTask() at ??:0 @ 0x7fe0b575bdc1 kudu::internal::RunnableAdapter<>::Run() at ??:0 Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 28 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/6808/3 -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
Mike Percy has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/6808/2/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS2, Line 370: if it doesn't also contain the COMMIT message for 1.1 : // yet then it will trigger a CHECK complaining about non-sequential OpIds : // in the WAL at tablet bootstrap time. > maybe this would be better stated as: "If the commit message for this NO_OP Done PS2, Line 373: string wal_dir = fs_manager->GetTabletWalDir(tablet_id); : vector wal_children; : ASSERT_OK(fs_manager->env()->GetChildren(wal_dir, &wal_children)); : // Skip '.', '..', and index files. : unordered_set wal_segments; : for (const auto& filename : wal_children) { : if (HasPrefixString(filename, FsManager::kWalFileNamePrefix)) { : wal_segments.insert(filename); : } : } > I don't think you need this. you're only checking for GT num files and the This is collecting the wal_segments which is used on L387 so I need the loop. The assertion is making sure we have at least 2 wal segments so that we know it's safe to delete one. I'll change it to ASSERT_GE(..., 2) to make it clearer. PS2, Line 383: Files in dir > I guess my confusion stemmed from the fact that "Files in dir:" can be inte Done -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-HasComments: Yes
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
David Ribeiro Alves has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/6808/2/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS2, Line 370: if it doesn't also contain the COMMIT message for 1.1 : // yet then it will trigger a CHECK complaining about non-sequential OpIds : // in the WAL at tablet bootstrap time. maybe this would be better stated as: "If the commit message for this NO_OP (OpId 1.1) was not written to disk yet, then it might get written _after_ the ops with the overflowed ids above, triggering a CHECK about non sequential OpIds. If we remove the first segment then the tablet will just assume that commit messages for all replicates in previous segments have already been written, thus avoiding the check. PS2, Line 373: string wal_dir = fs_manager->GetTabletWalDir(tablet_id); : vector wal_children; : ASSERT_OK(fs_manager->env()->GetChildren(wal_dir, &wal_children)); : // Skip '.', '..', and index files. : unordered_set wal_segments; : for (const auto& filename : wal_children) { : if (HasPrefixString(filename, FsManager::kWalFileNamePrefix)) { : wal_segments.insert(filename); : } : } I don't think you need this. you're only checking for GT num files and the ASSERT_OK on LOC 387 already returns an error if the file doesn't exist PS2, Line 383: Files in dir I guess my confusion stemmed from the fact that "Files in dir:" can be interpreted as "Files in this directory" which I did, and then I found odd that you were printing the files instead of the directory. Maybe rephrase a bit? -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-HasComments: Yes
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
Mike Percy has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Patch Set 2: Verified+1 Overriding Jenkins failure due to KUDU-1736. -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6808 to look at the new patch set (#2). Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Fix flaky test TestRecoverFromOpIdOverflow This test is flaky because we race against the COMMIT message for the first NO_OP in the WAL being written. It is currently hard to know when the actual COMMIT message is written to the WAL so we use a workaround to delete the first log segment before restarting the EMC in this test. If the COMMIT doesn't get written in time then the tablet bootstrap process doesn't prune that entry at startup time and it ends up in the new log prior to the much higher-numbered log entries. The flaky test failure was due to an out of order log index being detected, and looked like the following error in the log: F0504 21:28:21.128690 13908 raft_consensus_state.cc:502] Check failed: _s.ok() Bad status: Corruption: New operation's index does not follow the previous op's index. Current: 2147483648.2147483648. Previous: 1.1 *** Check failure stack trace: *** @ 0x7fe0adef915d google::LogMessage::Fail() at ??:0 @ 0x7fe0adefb05d google::LogMessage::SendToLog() at ??:0 @ 0x7fe0adef8c99 google::LogMessage::Flush() at ??:0 @ 0x7fe0adefbaff google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe0b57a8018 kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7fe0b57848a5 kudu::consensus::RaftConsensus::NotifyCommitIndex() at ??:0 @ 0x7fe0b5749ccf kudu::consensus::PeerMessageQueue::NotifyObserversOfCommitIndexChangeTask() at ??:0 @ 0x7fe0b575bdc1 kudu::internal::RunnableAdapter<>::Run() at ??:0 Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 24 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/6808/2 -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
Mike Percy has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/6808/1//COMMIT_MSG Commit Message: PS1, Line 9: This test is flaky because we race against the COMMIT message for the : first NO_OP in the WAL being written. It is currently hard to know when : the actual COMMIT message is written to the WAL so we use a workaround : to delete the first log segment before restarting the EMC in this test. > can you explain a little better how we'd get an overflow here? The flakiness is not caused by an overflow, it's caused by an out-of-order log index. I'll update the commit message to clarify that. http://gerrit.cloudera.org:8080/#/c/6808/1/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS1, Line 368: // Before restarting the tablet server, delete the initial log segment from : // disk (the original leader election NO_OP) if it exists since it will : // contain OpId 1.1; if it doesn't also contain the COMMIT message for 1.1 : // yet then it will trigger a CHECK complaining about non-sequential OpIds : // in the WAL at tablet bootstrap time. > If I understand what you're doing it that you remove the first segment to m I am just removing the first segment because it contains OpId 1.1 for the initial leader election which is followed in the WAL by something I inject which looks like OpId 2147483648.2147483648 The failure looked like this: F0504 21:28:21.128690 13908 raft_consensus_state.cc:502] Check failed: _s.ok() Bad status: Corruption: New operation's index does not follow the previous op's index. Current: 2147483648.2147483648. Previous: 1.1 *** Check failure stack trace: *** @ 0x7fe0adef915d google::LogMessage::Fail() at ??:0 @ 0x7fe0adefb05d google::LogMessage::SendToLog() at ??:0 @ 0x7fe0adef8c99 google::LogMessage::Flush() at ??:0 @ 0x7fe0adefbaff google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe0b57a8018 kudu::consensus::PendingRounds::AdvanceCommittedIndex() at ??:0 @ 0x7fe0b57848a5 kudu::consensus::RaftConsensus::NotifyCommitIndex() at ??:0 @ 0x7fe0b5749ccf kudu::consensus::PeerMessageQueue::NotifyObserversOfCommitIndexChangeTask() at ??:0 PS1, Line 383: wal_children > you mean "wal_dir" right? No, that just prints all the files in the wal dir for debugging purposes. We can get the wal dir from the log when the EMC starts up. -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: Yes
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
David Ribeiro Alves has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/6808/1//COMMIT_MSG Commit Message: PS1, Line 9: This test is flaky because we race against the COMMIT message for the : first NO_OP in the WAL being written. It is currently hard to know when : the actual COMMIT message is written to the WAL so we use a workaround : to delete the first log segment before restarting the EMC in this test. can you explain a little better how we'd get an overflow here? http://gerrit.cloudera.org:8080/#/c/6808/1/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS1, Line 368: // Before restarting the tablet server, delete the initial log segment from : // disk (the original leader election NO_OP) if it exists since it will : // contain OpId 1.1; if it doesn't also contain the COMMIT message for 1.1 : // yet then it will trigger a CHECK complaining about non-sequential OpIds : // in the WAL at tablet bootstrap time. If I understand what you're doing it that you remove the first segment to make sure the old op isn't committed on restart. Is that right? Can this happen in the wild? PS1, Line 383: wal_children you mean "wal_dir" right? -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: Yes
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow
Hello David Ribeiro Alves, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6808 to review the following change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow .. Fix flaky test TestRecoverFromOpIdOverflow This test is flaky because we race against the COMMIT message for the first NO_OP in the WAL being written. It is currently hard to know when the actual COMMIT message is written to the WAL so we use a workaround to delete the first log segment before restarting the EMC in this test. Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 --- M src/kudu/integration-tests/ts_recovery-itest.cc 1 file changed, 24 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/6808/1 -- To view, visit http://gerrit.cloudera.org:8080/6808 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ib382819307da04bb76d68d2c015dc0edd9f60267 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy Gerrit-Reviewer: David Ribeiro Alves