Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/4533/1//COMMIT_MSG
Commit Message:

PS1, Line 12:   
> Nit: got an extra space here.
Done


PS1, Line 18: tablet
> You mean 'table' here, right?
yeah, thanks for catching.


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS1, Line 178:     Status s = itest::GetConsensusState(ts,
             :                                         tablet_id,
             :                                         
consensus::CONSENSUS_CONFIG_COMMITTED,
             :                                         
MonoDelta::FromSeconds(10), &cstate);
             :     if (!s.ok()) {
             :       return s;
             :     }
> Just use RETURN_NOT_OK(...).
Done


Line 198:   int num_retries = 0;
> Why is this defined way up here instead of closer to where it's used?
Done, was thinking of keeping the *retries* variables together initially, but I 
agree about moving it down just before the use. Also just made it one variable 
now.


PS1, Line 204: CHECK_EQ
> Use ASSERT_* instead of CHECK_* since all of these are on the main test thr
Sure, could you tell me what's the relationship between this being a main 
thread and these macros here ? I remember asking this to you earlier in one of 
earlier rev comments but forgot to follow up in-person. 
I am just curious if there are debug-only versions of these macros ?


Line 218:   CHECK_EQ(master_reported_leader, current_leader);
> The leader can change between the GetInfoFromConsensus calls and L213, righ
In reality yes, but given that we are operating in a deterministic test 
environment where the server failures are manually triggered by code, I thought 
this check is safe to have. However, I am not entirely sure what this check is 
buying us, I just happen to notice that there are more than one way we could 
find the leader replica, and I thought it would be good to compare them when we 
know that they don't change on the fly during the test. I can remove this if 
you think this might be a room for any flakiness.


PS1, Line 230:     // Wait for re-election to happen again.
             :     // 3 seconds picked to accommodate consensus heartbeat 
timeout(1.5s)
> Use AssertEventually() to make this more robust, and to minimize the amount
I tried this with an unsuccessful result, later realized AssertEventually() 
kinda doesn't seem to fit this scenario here; Reason being: a) We really want 
to check the consensus state after the heartbeat timeout as opposed to 
executing f() and then sleeping;  AssertEventually() aggressively(every ms) 
keeps executing the f() until the timeout hits. b) I won't be able to use 
assert from inside the lambda argument when we know that Consensus state is 
still under flux.
Side note: I found one minor bug in AssertEventually() here which I can fix in 
another patch:
    int attempts = 0;
    int sleep_ms = (attempts < 10) ? (1 << attempts) : 1000;
    SleepFor(sleep_ms);

Perhaps we can devise another routine which sleeps for N secs and then executes 
the function and returns the status ?


Line 264:   CHECK_EQ(master_reported_leader, "");
> Let's not check this. It's a common calling convention that if a function f
Done


PS1, Line 271:   CHECK_EQ(current_leader, "");
             :   CHECK_EQ(current_term, 0);
> Likewise here, which means you needn't initialize current_term.
Done


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 194:   Status s = GetTabletLeader(client, tablet_id, &leader_uuid, 
&leader_hp);
> Consider consolidating from here to ResolveAddresses in a helper function, 
Done, although not sure if we need a toggle given that the routine anyways 
returns the Status and caller can decide if NotFound is FATAL or not ?


PS1, Line 198: else if (!s.ok()) {
             :     return s;
             :   }
> To pile on with Tidy Bot, just do RETURN_NOT_OK(s) after the not found chec
Done


PS1, Line 209:   ServerStatusPB status;
             :   RETURN_NOT_OK(GetServerStatus(leader_hp.host(), 
leader_hp.port(), &status));
> What's this for? I don't see status used anywhere.
I kinda misunderstood the combination of RETURN_NOT_OK(GetServerStatus... 
thinking that this will return failure if leader is not up; My goal here was to 
check if leader is up & running so that we need not send the RPC if the server 
is already down. It looks like this doesn't buy us anything since RPC will fail 
anyways if the leader is down. Hence removed.


PS1, Line 224:  return Status::RemoteError(resp.ShortDebugString());
> Why not return StatusFromPB(resp.error().status()) like ConfigChange?
Sounds good, just didn't know about this routine, thanks.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: Yes

Reply via email to