Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 18: (20 comments) Looking better! Mostly nits from me. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@146 PS18, Line 146: // Make sure the leader master has begun the auto-rebalancing thread. : void CheckAutoRebalancerStarted() { : ASSERT_EVENTUALLY([&] { : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_LT(0, NumLoopIterations(leader_idx)); : }); : } : : // Make sure the auto-rebalancing loop has iterated a few times, : // and no moves were scheduled. : void CheckNoMovesScheduled() { : ASSERT_EVENTUALLY([&] { : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_LT(3, NumLoopIterations(leader_idx)); : ASSERT_EQ(0, NumMovesScheduled(leader_idx)); : }); : } : : void CheckSomeMovesScheduled() { : ASSERT_EVENTUALLY([&] { : int leader_idx; : ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx)); : ASSERT_LT(1, NumLoopIterations(leader_idx)); : ASSERT_LT(0, NumMovesScheduled(leader_idx)); : }); : } It isn't important that these calls aren't repeatable is it? At least I don't think so, vs doing something like get initial moves scheduled get initial loop iterations ASSERT EVENTUALLY that the moves scheduled and loop iterations have increased/stayed the same http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@178 PS18, Line 178: nit: spacing http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@184 PS18, Line 184: nit: spacing http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@253 PS18, Line 253: ASSERT_EQ(0, NumLoopIterations(i)); : } nit: spacing http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@269 PS18, Line 269: // Create a cluster that is initially balanced. : // Bring up another tserver, and verify that moves are scheduled, : // since the cluster is no longer balanced. : TEST_F(AutoRebalancerTest, MovesScheduledIfAddTserver) { Do you think it's worth testing that recovering replicas doesn't count towards rebalancing? E.g. if I have 3 replicas on 4 servers and one server dies, I wouldn't expect there to be any rebalancing, even though I'd expect some movement of replicas. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@66 PS18, Line 66: // nit: remove the extra line http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@83 PS18, Line 83: friend class CatalogManager; nit: Is this still needed? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120 PS18, Line 120: RPC timeout in seconds nit: do you think it's worth elaborating on what happens when RPCs time out? Is it harmful if this value is set too high? Or too low? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@168 PS18, Line 168: number_of_loop_iterations_for_test_ = 0; : moves_scheduled_this_round_for_test_ = 0; nit: put these in the member initializer list? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@201 PS18, Line 201: SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds)); Is it safe to sleep here because we both don't have the lock and aren't the leader? Or will this lock us out of becoming leader? Maybe add a comment if this sleeping is indeed benign. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@249 PS18, Line 249: : s = ExecuteMoves(replica_moves, has_location_violations); : WARN_NOT_OK(s, Substitute("could not schedule auto-rebalancing replica moves: $0", : s.ToString())); nit: don't need to save the local variable; WARN_NOT_OK prints the Status. Same below. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@324 PS18, Line 324: location_raw_info, &algo, CrossLocations::NO, &rep_moves)); nit: add a couple spaces http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@339 PS18, Line 339: max_moves -= replica_moves->size(); We should probably exit early here if this is non-positive, lest we burn a bunch of CPU on rebuilding the cluster info below. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@396 PS18, Line 396: leader nit: this is only used once; maybe inline it? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@439 PS18, Line 439: if (!ts_manager_->LookupTSByUUID(leader_uuid, &desc)) { : return Status::NotFound("Could not find leader replica's tserver"); : } Would it be appropriate to also determine whether dst_ts_uuid is available to host a new tablet? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@447 PS18, Line 447: dst_ts_uuid nit: maybe DCHECK this isn't empty? Or predicate on dst_ts_uuid being empty rather than on has_location_violations? That'd make it easier to follow since we wouldn't need to maintain has_location_violations, if it's documented that !dst_ts_uuid.empty() is synonymous with fixing a location violation. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@451 PS18, Line 451: HostPortToPB nit: spacing http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@452 PS18, Line 452: else { : RETURN_NOT_OK(hp.ParseString(leader_uuid, leader_hp.port())); : vector<Sockaddr> resolved; : RETURN_NOT_OK(hp.ResolveAddresses(&resolved)); : proxy.reset(new ConsensusServiceProxy(messenger_, resolved[0], hp.host())); : } : We get the leader proxy unconditionally, so maybe pull it out of the conditional block? Also should we LookupTS the leader here too? http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc@313 PS18, Line 313: runtime This probably shouldn't be labeled "runtime" given the initialization of the thread happens when we start up. I think "advanced" and "experimental" might be appropriate. http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc File src/kudu/rebalance/rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc@528 PS18, Line 528: move_tablet_id nit: can std::move(move_tablet_id) -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 18 Gerrit-Owner: Hannah Nguyen <hannah.ngu...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hannah Nguyen <hannah.ngu...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Tue, 03 Mar 2020 21:47:06 +0000 Gerrit-HasComments: Yes