Adar Dembo has posted comments on this change. Change subject: [tools] Implement a manual leader_step_down for a tablet ......................................................................
Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: PS1, Line 204: } > Sure, could you tell me what's the relationship between this being a main t The CHECK_* macros come from glog, and they all do the same thing regardless of the context in which they're called: 1. Evaluate an expression. 2. If evaluation fails, log something. 3. Dump a stack trace. 4. Call abort() to exit the process. The DCHECK_* macros (also from glog) are debug-only variants that are otherwise equivalent, but compiled out in non-debug builds. The ASSERT_* macros come from gtest, so they're specific to how gtest works. They too evaluate an expression, but on failure, they do some gtest stuff to capture the failure, end the run of the current test, and move to the next test. They have some foibles, and one of them is that ASSERT failures in the thread that isn't the main process thread don't seem to get noticed by gtest. Maybe there's something we need to do for that to happen; if there is, I don't know what it is. Oh, and EXPECT_* is like ASSERT_* except that the test keeps running. So ASSERT_* will only ever reveal the "first" failure, while EXPECT_* will reveal all of them. So to summarize: - Use CHECK_*/DCHECK_* macros in production code. Use DCHECK_* if you expect to get sufficient test coverage of the callsite, CHECK_* otherwise. - Use ASSERT_*/EXPECT_* macros in test-only code, unless that code is expected to run on a separate test thread, in which case use CHECK_* too. Default to EXPECT_*, unless the operation in question must succeed for the rest of the test to function properly, in which case use ASSERT_*. Hope this helps. Line 218: ASSERT_OK(Subprocess::Call({ > In reality yes, but given that we are operating in a deterministic test env Unfortunately a test environment with leader failure detection isn't deterministic. The test can run on a machine with arbitrary load (perhaps from other tests), in which case one of the tserver processes can pause for arbitrary periods of time, causing a new leader to be elected. This is one of the leading causes of (low-grade) test flakiness for us. PS1, Line 230: CHECK_OK(GetInfoFromConsensus(tservers, tablet_id_, : &new_leader, &new_term)); > I tried this with an unsuccessful result, later realized AssertEventually() Couple things: 1. I don't understand why AssertEventually() doesn't work here. AFAICT the goal of this snippet is "tell the leader to step down, wait for an election to happen, then assert that the new term is higher than the one before". If you wrap L229-L233 in AssertEventually(), doesn't that mean you don't need the 3 second sleep (which could be flaky to begin with, since a cluster running with ASAN/TSAN could take a lot longer to run an election)? I don't see what the issue is; can you help me understand? Look at some of the existing usages of AssertEventually() if it's not clear. 2. I'm not seeing the bug in the code you pasted. What is it? Line 264: s = GetInfoFromConsensus(tservers, tablet_id_, > Done The actual check (not in the CHECK sense of the word, but the comparison and assertion) is still here, though. http://gerrit.cloudera.org:8080/#/c/4533/2/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: PS2, Line 251: CHECK_OK Still got some leftover CHECK_* statements that should be converted to their ASSERT_* equivalents. http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc File src/kudu/tools/tool_action_tablet.cc: PS1, Line 194: > Done, although not sure if we need a toggle given that the routine anyways On second thought, the ResolveAddresses output isn't even used, is it? So let's drop it altogether. Looks like address resolution happens within BuildProxy() so that extra call was redundant from day one (in ConfigChange() as well). PS1, Line 274: : > Nit: reformat as "Force the tablet's leader replica (if present) to step do You missed these comments down here. -- To view, visit http://gerrit.cloudera.org:8080/4533 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes