Dinesh Bhat has posted comments on this change.

Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority 
replicas
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6066/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS1, Line 1556: RaftConsensus::ReplaceConfig
> I'll let mike address the pending/active config concerns of this method, si
TFTR David, just a heads up here that latest patch doesn't reflect all the 
suggestions made above, and some rationales are listed below.

1) I agree with adding checks about consensus being stuck for a while instead 
of letting the tool operate immediately without allowing automatic recovery. Is 
there a metrics we should rely on ? or should that check be a function of some 
peers in the config and they being unresponsive for a while ? The latter case 
falls under node eviction after certain timeouts, so we could wait for at least 
follower_unavailable_considered_failed_sec.

2) I am having second thoughts on making the API accept list of removed uuids, 
because we want to be able to enforce a config of {AB} on node A despite a 
committed config of {ACD} when C and D become dead nodes at some point. This 
also means that, tool also needs to input the Host and Port address endpoint of 
B and there are other commands to fetch that info. Currently uuid presence 
check is overly restrictive becz it checks if the passed uuid is in config or 
not and it accepts only one uuid at the moment. I need to add a unit test for 
the checks I have introduced in this API - i.e simulating user errors and see 
that they fail as expected.
Having the API accept a list of uuids to be added for the new config seems like 
a better idea. This requires either a new RPC taking RaftConfigPB in the 
request directly, or ChangeConfigRequestPB can take a list of RaftPeerPB and 
the API can construct the new config from that list. Usual 
AddServer/RemoveServer can expect the size of that list to be 1. Let me know 
your thoughts.

3) I refactored a bit grabbing consensus states and consensus queue states as 
one snapshot before building the new config and consensus request for Update(). 
The API assumes currently that dead nodes aren't interfering, so this snapshot 
of the state becomes relevant when we weaken those assumptions for future 
revisions.


-- 
To view, visit http://gerrit.cloudera.org:8080/6066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

Reply via email to