Dinesh Bhat has posted comments on this change. Change subject: KUDU-1330: Add a tool to unsafely recover from loss of majority replicas ......................................................................
Patch Set 16: (34 comments) http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 494: // The raft config sent to destination server. > Add: Only the 'permanent_uuid' of each peer in the config is required (addr Done http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS16, Line 553: : > nit: add space after colon Done PS16, Line 554: : > and here Done PS16, Line 777: : > nit: space after colon here and elsewhere in this log message Done http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS16, Line 1167: : > space after colon here and in the rest of this commit message Done PS16, Line 1330: > nit: Add a period before this space. Done PS16, Line 1594: Node > Replica Done PS16, Line 1594: so new config will be appended. > but the new config will be unsafely changed anyway. Done PS16, Line 1595: : > ": " Done Line 1596: << state_->GetPendingConfigUnlocked().DebugString(); > Add: It may not be a good idea to display new config here especially when we know that what the tool passed in has just the list of uuids with default values for RaftConfigPB fields. new_config is going to be built later on looking at committed_config and we are displaying that at L1686, so its redundant too here. PS16, Line 1596: ebugString > use SecureShortDebugString() instead Done PS16, Line 1605: and that node is a VOTER in the config > remove this Done PS16, Line 1606: This is to prevent the API allowing an invalid uuid into the new config. > Consider changing this to: This allows a manual recovery tool to only have Done PS16, Line 1616: for tablet $1 > for tablet $1. Committed config: $2 Done PS16, Line 1630: we don't want to enforce unsafe config that way > Let's replace this part of the comment with: "it is rare and a replica with Done PS16, Line 1633: committed > new ok I will go with this, but the reason I was apprehensive of using new/committed is, VOTER/NON_VOTER field is read from committed config and user doesn't have a say in that. Hence the error message quoting "not a VOTER in new config" may be misleading. PS16, Line 1635: tablet $1 > tablet $1. Rejected config: $2 Done PS16, Line 1639: preceding_opid.index() + 1 > Define replicate_opid_index up here and use it here as well as down below. Done PS16, Line 1660: leader_term > let's rename this variable 'new_term' Done PS16, Line 1680: DVLOG(1) << "Consensus request: " > VLOG_WITH_PREFIX(3) << "UnsafeChangeConfig: Generated consensus request: " Done Line 1681: DVLOG(1) << "Replicate Msg: " << SecureShortDebugString(*replicate); > Remove this line since it's redundant with the line above Done PS16, Line 1684: , > nit: add space after comma Done PS16, Line 1685: DebugString() > SecureShortDebugString() here and below Done PS16, Line 1685: : > space after colon, not before Done PS16, Line 1686: : > here also: space after colon, not before Done http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/consensus/raft_consensus_state.cc File src/kudu/consensus/raft_consensus_state.cc: PS16, Line 241: ( > nit: Remove extra pair of parentheses here Done PS16, Line 245: << Substitute("New committed config must equal pending config, but does not. " : "Pending config: $0, committed config: $1", : SecureShortDebugString(pending_config), : SecureShortDebugString(config_to_commit)); > nit: it seems like the indentation is a little messed up on these 4 lines n Done http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/tools/tool_action_remote_replica.cc File src/kudu/tools/tool_action_remote_replica.cc: Line 45: #include "kudu/tools/tool_test_util.h" > Is this used? It looks like a test utility library? yeah, actually I was initially using GetKuduCtlAbsolutePath("/bin/kudu") instead of random "kudu-tools" as caller id. I updated the code at L381 now. PS16, Line 366: LOG(WARNING) << "NOTE: the new config may be replicated asynchronously " : << "to other peers and it may take a while to bring the tablet " : << "back to majority replica count depending upon how much time " : << "the new peer replicas take to catch up with the tablet server " : << "enforcing the new config."; > I think we should remove this message. I think it is confusing because it t My concern about removing this message is that, user wouldn't know about these operations happening in the background and think that the cluster is not back online even after 10 minutes. This is where a warning message like this may help so that he can check the logs to see whats going on behind the scene. Anyways I removed it for now, and we can add it back later if necessary. PS16, Line 378: MonoDelta::FromSeconds(30) > MonoDelta::FromMilliseconds(FLAGS_timeout_ms) Done http://gerrit.cloudera.org:8080/#/c/6066/16/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: PS16, Line 885: DVLOG(1) > how about: VLOG(2) Done PS16, Line 909: DVLOG(1) > how about: VLOG(2) Done PS16, Line 910: ChangeConfig > UnsafeChangeConfig Done PS16, Line 923: s > Status s Done -- To view, visit http://gerrit.cloudera.org:8080/6066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I908d8c981df74d56dbd034e72001d379fb314700 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Dinesh Bhat <din...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes