Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: WIP KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 5: (31 comments) http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@50 PS5, Line 50: FLAGS_auto_rebalancing_stop_flag You probably meant to set the interval here. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@87 PS5, Line 87: : if (!AllowSlowTests()) { : LOG(WARNING) << "test is skipped; set KUDU_ALLOW_SLOW_TESTS=1 to run"; : return; : } nit: you can use SKIP_IF_SLOW_NOT_ALLOWED() for this. Same elsewhere. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@104 PS5, Line 104: SleepFor(MonoDelta::FromSeconds(10)); : : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : for (int i=0; i<kNumMasters; i++) { : if (i==leader_idx) { : ASSERT_NE(0, number_of_loop_iterations(i)); : } : else { : ASSERT_EQ(0, moves_scheduled_this_round(i)); : } : } : Rather than sleeping this long, how about asserting with ASSERT_EVENTUALLY that the condition eventually becomes true. nit: fix spacing http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@117 PS5, Line 117: cluster_->Shutdown(); nit: don't need this since it gets called in TearDown() http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@144 PS5, Line 144: auto iterations = number_of_loop_iterations(new_leader_idx); : ASSERT_LT(0, iterations); Should this be wrapped in ASSERT_EVENTUALLY? What if the thread hasn't started iterating yet? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@149 PS5, Line 149: TEST_F(AutoRebalancerTest, NoReplicaMovesIfBalanced) { Consider breaking this into smaller tests so the cases are parallelizable. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@200 PS5, Line 200: num_tservers = 2; : num_tablets = 1; It doesn't seem particularly useful to keep defining these separately, and IMO it hurts readability, since I still need to read through the rest of the below setup to understand the big picture. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@224 PS5, Line 224: SleepFor(MonoDelta::FromSeconds(10)); : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_EQ(moves_scheduled_this_round(leader_idx), 0); I wonder if it's sufficient to just define "no moves" as having the loop iteration count increase, but having no moves scheduled. Then we wouldn't have to sleep so much. WDYT? Also might be worth putting that into some CheckNoMovesScheduled() function. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@263 PS5, Line 263: ASSERT_GE ASSERT_EQ()? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@63 PS5, Line 63: // the catalog manager must be in the process of initializing. nit: Why does this need to be the case? The only invariant I see right now is that Init() cannot have previously succeeded (lest devs feel the wrath of the CHECK failure) http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@71 PS5, Line 71: int number_of_loop_iterations; : int moves_scheduled_this_round; Nit: could you suffix these with _for_test so it's obvious that their access can be mentally ignored when reading code http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@88 PS5, Line 88: Status GetMovesUsingRebalancingAlgo( : const rebalance::ClusterRawInfo& raw_info, : std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves, : rebalance::RebalancingAlgo* algo, : bool cross_location = false) const; : : Status GetMoves(std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves, : bool* policy_fixing) const; : : Status GetTabletLeader( : const std::string& tablet_id, : std::string* leader_uuid, : HostPort* leader_hp) const; : : Status ExecuteMoves( : const std::vector<rebalance::Rebalancer::ReplicaMove>& replica_moves, : const bool& policy_fixing); nit: add docs. What do these and their arguments do? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@122 PS5, Line 122: rebalancer_ Does this _need_ to be heap allocated? http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc@377 PS3, Line 377: unordered_set<string> tablets_in_move; : for (const auto& move : moves) { : vector<string> tablet_ids; : RETURN_NOT_OK(rebalancer_->FindReplicas(move, raw_info, &tablet_ids)); : if (cross_location) { : // In case of cross-location (a.k.a. inter-location) rebalancing it is : // necessary to make sure the majority of replicas would not end up : // at the same location after the move. If so, remove those tablets : // from the list of candidates. : RETURN_NOT_OK(rebalancer_->FilterCrossLocationTabletCandidates( : cluster_info.locality.location_by_ts_id, tpi, move, &tablet_ids)); : } : // Shuffle the set of the tablet identifiers: that's to achieve even spread : // of moves across tables with the same skew. : std::shuffle(tablet_ids.begin(), tablet_ids.end(), random_device()); : string move_tablet_id; : for (const auto& tablet_id : tablet_ids) { : if (tablets_in_move.find(tablet_id) == tablets_in_move.end()) { : // For now, choose the very first tablet that does not have replicas : // in move. Later on, additional logic might be added to find : // the best candidate. : move_tablet_id = tablet_id; : break; : } : } > Yeah, maybe it'd make sense as a part of the rebalance namespace. 30 lines As a reviewer, my main gripe with this is that in reviewing this patch, it's pretty difficult to distinguish the new code from code that has been used already in existing rebalancer code. Code reuse also hopefully makes further changes to this code easier, since we wouldn't have to worry about updating multiple copy-paste sites. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@200 PS5, Line 200: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); Should we just sleep at the top of the loop so we don't have to repeat this before every continue? Alternatively, check out SCOPED_CLEANUP(), which is helpful to ensure something gets run when leaving scope. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@209 PS5, Line 209: cluster_info_ Nit: since we're blowing these away every iteration, how about just making these local variables and passing them around as const refs? Then it's more obviously correct that eg calls to Get moved are using the correct cluster_info_ computed in this iteration. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@215 PS5, Line 215: s.ToString() << ": could not retrieve cluster info"; nit: consider using Substitute() from gutil/strings/substitute.h here, e.g. Substitute("Could not retrieve cluster info: $0", s.ToString()); same elsewhere Also maybe this should be a warning? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@221 PS5, Line 221: (s.ok()) nit: drop the extra parens; same elsewhere http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@238 PS5, Line 238: this->I nit: drop this->? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@245 PS5, Line 245: if (!FLAGS_auto_rebalancing_stop_flag) { We should probably move this check to the top of the loop so we don't go through all of the above steps needlessly. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@274 PS5, Line 274: bool* policy_fixing nit: it wasn't immediately obvious to me from "policy fixing" that this is referring to location awareness. Maybe rename to "should_fix_locations" or "has_location_violations" or something? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@286 PS5, Line 286: replica_moves Nit: space before http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@336 PS5, Line 336: bool cross_location nit: for readability, consider using an enum like CrossLocations::{YES, NO} http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@334 PS5, Line 334: vector<Rebalancer::ReplicaMove>* replica_moves, : rebalance::RebalancingAlgo* algo, : bool cross_location nit: mind reordering these so the pure out param (replica_moves) comes after the in/out and input params? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@474 PS5, Line 474: unique_ptr<ConsensusServiceProxy> proxy; Does this have to be heap-allocated? Same elsewhere. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@681 PS5, Line 681: Is nit: sounds a bit awkward. Maybe remove? or "AreCluster.."? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@687 PS5, Line 687: max_replica_count-min_replica_count nit: add spaces, same below http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@691 PS5, Line 691: const auto min_table_skew = table_skew_info.begin()->first; : const auto m const auto&? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@709 PS5, Line 709: // One location: table and cluster skew? nit: complete sentences in comments are preferred. Same elsewhere. http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc@5202 PS5, Line 5202: : TSManager* CatalogManager::ts_manager() const { : return master_->ts_manager(); : } : nit: Rather than exposing this from the catalog manager, how about passing it into the constructor of AutoRebalancerTask? http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h File src/kudu/rebalance/rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h@147 PS5, Line 147: // Public accessors for fields of Rebalancer's config object. : bool ShouldRunPolicyFixer() const; : bool ShouldRunCrossLocationRebalancing() const; : bool ShouldRunIntraLocationRebalancing() const; Don't need these anymore -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 5 Gerrit-Owner: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hannah Nguyen <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Mon, 23 Sep 2019 05:43:34 +0000 Gerrit-HasComments: Yes
