Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16936 )
Change subject: [master] KUDU-2181 Raft ChangeConfig request to remove a master ...................................................................... Patch Set 1: (16 comments) http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc File src/kudu/master/dynamic_multi_master-test.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@346 PS1, Line 346: if (master != HostPort()) { : *req.mutable_rpc_addr() = HostPortToPB(master); : } What does it mean for this to be called with an unset 'master' field? Would be good to add a comment. Ah, made my way down to the test that uses this. Would it make sense to instead provide a garbage HostPort value from HostPort::ParseString? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@474 PS1, Line 474: TransferMasterLeadershipAndVerify nit: "verify" is a bit overloaded, especially when in a file that can misconstrue it with using the ClusterVerifier. How about calling this TransferMasterLeadership and calling TransferMasterLeadership TransferMasterLeadershipAsync? Or better yet, have them be the same method, but with an optional bool parameter that tells the method to wait for success. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@787 PS1, Line 787: // When an ExternalMiniCluster is restarted after removal of a master then one of the : // remaining masters can get reassigned to the same wal dir which was previously assigned : // to the removed master. This causes problems during verification, so we always try to : // remove the last master in terms of index for test purposes. Does this problem go away if, rather than creating a new mini cluster, we just restart the old one after the call to RemoveMaster()? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@805 PS1, Line 805: // We need at least 1 operation for the leader in current term before initiating ChangeConfig. I thought this happens by replicating a no-op when first becoming leader. Or is this referring to something else? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@814 PS1, Line 814: ASSERT_EVENTUALLY([&] { : VerifyDeadMasterInSpecifiedState(master_to_remove_uuid, consensus::HealthReportPB::UNKNOWN); : }); It seems possible we'd trigger a test failure if, by luck, the gaps in ASSERT_EVENTUALLY missed this state entirely. It'd be more robust to just assert we get to FAILED, given it's a terminal state. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@834 PS1, Line 834: ASSERT_FALSE(ContainsKey(actual_master_hps, master_to_remove)); nit: while I appreciate the dedicated checking, especially when it comes to longer tests like these, we can probably do without assertions that are easily verified by reading a handful of lines of nearby, relatively trivial code. In this case, this is redundant with the L833 check since we've removed `master_to_remove` from `expected_master_hps`. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@853 PS1, Line 853: cluster_.reset(); Does cluster_->Shutdown(); ASSERT_OK(cluster_->Restart()); not work? I think it'd be much less surprising to further users of the EMC who may want to dynamically update the masters. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@930 PS1, Line 930: /* Omitting "--master_support_change_config" */ nit: maybe just set it to false now so we don't have to update this test when we enable it by default? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@995 PS1, Line 995: It's possible there is a leadership change between querying for leader master and : // sending the add master request to non-leader master and hence using ASSERT_EVENTUALLY. Is this not a concern in other tests? Would setting --leader_failure_max_missed_heartbeat_periods to something very high help avoid this case? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@1009 PS1, Line 1009: ASSERT_OK(cluster_->master_proxy(non_leader_master_idx)->RemoveMaster(req, &resp, &rpc)); What if this succeeds because we happened to elect non_leader_master_idx as leader? Wouldn't the call to VerifyNumMastersAndGetAddresses never succeed? Or wouldn't we be unable to successfully remove another Master? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc File src/kudu/master/master.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@532 PS1, Line 532: Substitute nit: nothing being substituted? Maybe add the leader's UUID? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master_service.cc File src/kudu/master/master_service.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master_service.cc@277 PS1, Line 277: if (!FLAGS_master_support_change_config) { : rpc->RespondFailure(Status::NotSupported("Removing master is not supported")); : return; : } Hrm, can you remind me why it's insufficient to rely on the SupportsFeature flags? A comment here and above would be nice. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.h File src/kudu/mini-cluster/external_mini_cluster.h: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.h@483 PS1, Line 483: // Removes the master from the ExternalMiniCluster when the master has been removed : // dynamically after bringing up the ExternalMiniCluster. nit: just reading this, a few things aren't immediately clear: - What do I, as a user of this function, need to do to the cluster before calling this? Exactly when is "when the master has been removed"? - After calling this, what differences in behavior should I expect for further clients? http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc File src/kudu/mini-cluster/external_mini_cluster.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@964 PS1, Line 964: ) { nit: Why not ++it here? Seems more idiomatic. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@966 PS1, Line 966: m nit: if this is only used once, maybe just inline it as (*it)->bound_rpc_hostport() http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@972 PS1, Line 972: : r nit: drop the extra line -- To view, visit http://gerrit.cloudera.org:8080/16936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I76c03b8850faef60b65f85184c0a4db7cc6759ee Gerrit-Change-Number: 16936 Gerrit-PatchSet: 1 Gerrit-Owner: Bankim Bhavsar <ban...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 12 Jan 2021 01:32:10 +0000 Gerrit-HasComments: Yes