Dinesh Bhat has posted comments on this change. Change subject: [tools] Implement a manual leader_step_down for a tablet ......................................................................
Patch Set 1: (13 comments) http://gerrit.cloudera.org:8080/#/c/4533/1//COMMIT_MSG Commit Message: PS1, Line 12: > Nit: got an extra space here. Done PS1, Line 18: tablet > You mean 'table' here, right? yeah, thanks for catching. http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc File src/kudu/tools/kudu-admin-test.cc: PS1, Line 178: Status s = itest::GetConsensusState(ts, : tablet_id, : consensus::CONSENSUS_CONFIG_COMMITTED, : MonoDelta::FromSeconds(10), &cstate); : if (!s.ok()) { : return s; : } > Just use RETURN_NOT_OK(...). Done Line 198: int num_retries = 0; > Why is this defined way up here instead of closer to where it's used? Done, was thinking of keeping the *retries* variables together initially, but I agree about moving it down just before the use. Also just made it one variable now. PS1, Line 204: CHECK_EQ > Use ASSERT_* instead of CHECK_* since all of these are on the main test thr Sure, could you tell me what's the relationship between this being a main thread and these macros here ? I remember asking this to you earlier in one of earlier rev comments but forgot to follow up in-person. I am just curious if there are debug-only versions of these macros ? Line 218: CHECK_EQ(master_reported_leader, current_leader); > The leader can change between the GetInfoFromConsensus calls and L213, righ In reality yes, but given that we are operating in a deterministic test environment where the server failures are manually triggered by code, I thought this check is safe to have. However, I am not entirely sure what this check is buying us, I just happen to notice that there are more than one way we could find the leader replica, and I thought it would be good to compare them when we know that they don't change on the fly during the test. I can remove this if you think this might be a room for any flakiness. PS1, Line 230: // Wait for re-election to happen again. : // 3 seconds picked to accommodate consensus heartbeat timeout(1.5s) > Use AssertEventually() to make this more robust, and to minimize the amount I tried this with an unsuccessful result, later realized AssertEventually() kinda doesn't seem to fit this scenario here; Reason being: a) We really want to check the consensus state after the heartbeat timeout as opposed to executing f() and then sleeping; AssertEventually() aggressively(every ms) keeps executing the f() until the timeout hits. b) I won't be able to use assert from inside the lambda argument when we know that Consensus state is still under flux. Side note: I found one minor bug in AssertEventually() here which I can fix in another patch: int attempts = 0; int sleep_ms = (attempts < 10) ? (1 << attempts) : 1000; SleepFor(sleep_ms); Perhaps we can devise another routine which sleeps for N secs and then executes the function and returns the status ? Line 264: CHECK_EQ(master_reported_leader, ""); > Let's not check this. It's a common calling convention that if a function f Done PS1, Line 271: CHECK_EQ(current_leader, ""); : CHECK_EQ(current_term, 0); > Likewise here, which means you needn't initialize current_term. Done http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc File src/kudu/tools/tool_action_tablet.cc: PS1, Line 194: Status s = GetTabletLeader(client, tablet_id, &leader_uuid, &leader_hp); > Consider consolidating from here to ResolveAddresses in a helper function, Done, although not sure if we need a toggle given that the routine anyways returns the Status and caller can decide if NotFound is FATAL or not ? PS1, Line 198: else if (!s.ok()) { : return s; : } > To pile on with Tidy Bot, just do RETURN_NOT_OK(s) after the not found chec Done PS1, Line 209: ServerStatusPB status; : RETURN_NOT_OK(GetServerStatus(leader_hp.host(), leader_hp.port(), &status)); > What's this for? I don't see status used anywhere. I kinda misunderstood the combination of RETURN_NOT_OK(GetServerStatus... thinking that this will return failure if leader is not up; My goal here was to check if leader is up & running so that we need not send the RPC if the server is already down. It looks like this doesn't buy us anything since RPC will fail anyways if the leader is down. Hence removed. PS1, Line 224: return Status::RemoteError(resp.ShortDebugString()); > Why not return StatusFromPB(resp.error().status()) like ConfigChange? Sounds good, just didn't know about this routine, thanks. -- To view, visit http://gerrit.cloudera.org:8080/4533 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes