Hello David Ribeiro Alves, Mike Percy,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/6066

to review the following change.

Change subject: WIP KUDU-1330: Add a tool to unsafely recover from loss of 
majority replicas
......................................................................

WIP KUDU-1330: Add a tool to unsafely recover from loss of majority replicas

This patch adds an API to allow unsafe config change via an external
recovery tool 'kudu remote_replica replace_config'.

This tool lets us replace a 3-node config on a tablet server with a
1-node config. This is perticularly useful when we have 2 out of 3
replicas down and we want to bring the tablet back to operational state.
We can use this tool to force a new config on the surviving node providing
all the details of the new config from the tool. As a result
of the forced config change, the automatic leader election kicks in via
raft mechanisms and the re-replication is triggered from master to bring
the replica count back upto 3-node config.
The lonely survivor talking to the tool tends to become the leader in
new config in majority of the use cases because:
a) The API/tool acts as a fake leader mimicking the actual leader the node
   had voted for and replicate the new config with a higher term and
   bumped up op_index. This ensures that other 2 nodes added later on respect
   the term emitted by this node and accept this node as leader.
a) Assumption is that, the dead nodes are not coming back with a higher term,
   hence leadership is retained.

Also the ReplaceConfig() API adds a way to abort a pending config change
because pending config comes in the way of recovery tool trying to
replicate/commit the new config on the surviving tablet server. There is only
one pending config change allowed at a time for a given tablet, hence
aborting the pending config change seems safest bet.

This patch is a first in series for unsafe config changes, and assumes that
the dead servers are not coming back while the new config change is taking
effect.

TODO:
0) Accept more replica_uuids from the command line to make support multiple
   peers to be added in the new config.
1) Add a test case when 1 leader is alive.
2) Add a test case for when the node has a pending config change,
   covering the cases when the node is a leader or a follower.
3) Test with a 5-replica config forcing the old {ABCDE} to new {AB} on A.

Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
---
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
11 files changed, 310 insertions(+), 86 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/6066/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <din...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>

Reply via email to