Hello David Ribeiro Alves, Mike Percy, I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/6066 to review the following change. Change subject: WIP KUDU-1330: Add a tool to unsafely recover from loss of majority replicas ...................................................................... WIP KUDU-1330: Add a tool to unsafely recover from loss of majority replicas This patch adds an API to allow unsafe config change via an external recovery tool 'kudu remote_replica replace_config'. This tool lets us replace a 3-node config on a tablet server with a 1-node config. This is perticularly useful when we have 2 out of 3 replicas down and we want to bring the tablet back to operational state. We can use this tool to force a new config on the surviving node providing all the details of the new config from the tool. As a result of the forced config change, the automatic leader election kicks in via raft mechanisms and the re-replication is triggered from master to bring the replica count back upto 3-node config. The lonely survivor talking to the tool tends to become the leader in new config in majority of the use cases because: a) The API/tool acts as a fake leader mimicking the actual leader the node had voted for and replicate the new config with a higher term and bumped up op_index. This ensures that other 2 nodes added later on respect the term emitted by this node and accept this node as leader. a) Assumption is that, the dead nodes are not coming back with a higher term, hence leadership is retained. Also the ReplaceConfig() API adds a way to abort a pending config change because pending config comes in the way of recovery tool trying to replicate/commit the new config on the surviving tablet server. There is only one pending config change allowed at a time for a given tablet, hence aborting the pending config change seems safest bet. This patch is a first in series for unsafe config changes, and assumes that the dead servers are not coming back while the new config change is taking effect. TODO: 0) Accept more replica_uuids from the command line to make support multiple peers to be added in the new config. 1) Add a test case when 1 leader is alive. 2) Add a test case for when the node has a pending config change, covering the cases when the node is a leader or a follower. 3) Test with a 5-replica config forcing the old {ABCDE} to new {AB} on A. Change-Id: I908d8c981df74d56dbd034e72001d379fb314700 --- M src/kudu/consensus/consensus.h M src/kudu/consensus/consensus.proto M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tools/kudu-admin-test.cc M src/kudu/tools/tool_action_remote_replica.cc M src/kudu/tserver/tablet_service.cc 11 files changed, 310 insertions(+), 86 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/6066/1 -- To view, visit http://gerrit.cloudera.org:8080/6066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Mike Percy <mpe...@apache.org>