Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 18:

(20 comments)

Looking better! Mostly nits from me.

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@146
PS18, Line 146:   // Make sure the leader master has begun the auto-rebalancing 
thread.
              :   void CheckAutoRebalancerStarted() {
              :     ASSERT_EVENTUALLY([&] {
              :       int leader_idx;
              :       ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
              :       ASSERT_LT(0, NumLoopIterations(leader_idx));
              :     });
              :   }
              :
              :   // Make sure the auto-rebalancing loop has iterated a few 
times,
              :   // and no moves were scheduled.
              :   void CheckNoMovesScheduled() {
              :     ASSERT_EVENTUALLY([&] {
              :       int leader_idx;
              :       ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
              :       ASSERT_LT(3, NumLoopIterations(leader_idx));
              :       ASSERT_EQ(0, NumMovesScheduled(leader_idx));
              :     });
              :   }
              :
              :   void CheckSomeMovesScheduled() {
              :     ASSERT_EVENTUALLY([&] {
              :       int leader_idx;
              :       ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
              :       ASSERT_LT(1, NumLoopIterations(leader_idx));
              :       ASSERT_LT(0, NumMovesScheduled(leader_idx));
              :     });
              :   }
It isn't important that these calls aren't repeatable is it? At least I don't 
think so, vs doing something like

 get initial moves scheduled
 get initial loop iterations
 ASSERT EVENTUALLY that the moves scheduled and loop iterations have 
increased/stayed the same


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@178
PS18, Line 178:
nit: spacing


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@184
PS18, Line 184:
nit: spacing


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@253
PS18, Line 253:        ASSERT_EQ(0, NumLoopIterations(i));
              :       }
nit: spacing


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer-test.cc@269
PS18, Line 269: // Create a cluster that is initially balanced.
              : // Bring up another tserver, and verify that moves are 
scheduled,
              : // since the cluster is no longer balanced.
              : TEST_F(AutoRebalancerTest, MovesScheduledIfAddTserver) {
Do you think it's worth testing that recovering replicas doesn't count towards 
rebalancing? E.g. if I have 3 replicas on 4 servers and one server dies, I 
wouldn't expect there to be any rebalancing, even though I'd expect some 
movement of replicas.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h
File src/kudu/master/auto_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@66
PS18, Line 66: //
nit: remove the extra line


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.h@83
PS18, Line 83:   friend class CatalogManager;
nit: Is this still needed?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@120
PS18, Line 120: RPC timeout in seconds
nit: do you think it's worth elaborating on what happens when RPCs time out? Is 
it harmful if this value is set too high? Or too low?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@168
PS18, Line 168:   number_of_loop_iterations_for_test_ = 0;
              :   moves_scheduled_this_round_for_test_ = 0;
nit: put these in the member initializer list?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@201
PS18, Line 201:         
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
Is it safe to sleep here because we both don't have the lock and aren't the 
leader? Or will this lock us out of becoming leader? Maybe add a comment if 
this sleeping is indeed benign.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@249
PS18, Line 249:
              :     s = ExecuteMoves(replica_moves, has_location_violations);
              :     WARN_NOT_OK(s, Substitute("could not schedule 
auto-rebalancing replica moves: $0",
              :                               s.ToString()));
nit: don't need to save the local variable; WARN_NOT_OK prints the Status. Same 
below.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@324
PS18, Line 324:         location_raw_info, &algo, CrossLocations::NO, 
&rep_moves));
nit: add a couple spaces


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@339
PS18, Line 339:   max_moves -= replica_moves->size();
We should probably exit early here if this is non-positive, lest we burn a 
bunch of CPU on rebuilding the cluster info below.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@396
PS18, Line 396: leader
nit: this is only used once; maybe inline it?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@439
PS18, Line 439:       if (!ts_manager_->LookupTSByUUID(leader_uuid, &desc)) {
              :         return Status::NotFound("Could not find leader 
replica's tserver");
              :       }
Would it be appropriate to also determine whether dst_ts_uuid is available to 
host a new tablet?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@447
PS18, Line 447: dst_ts_uuid
nit: maybe DCHECK this isn't empty? Or predicate on dst_ts_uuid being empty 
rather than on has_location_violations? That'd make it easier to follow since 
we wouldn't need to maintain has_location_violations, if it's documented that 
!dst_ts_uuid.empty() is synonymous with fixing a location violation.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@451
PS18, Line 451: HostPortToPB
nit: spacing


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/auto_rebalancer.cc@452
PS18, Line 452:  else {
              :       RETURN_NOT_OK(hp.ParseString(leader_uuid, 
leader_hp.port()));
              :       vector<Sockaddr> resolved;
              :       RETURN_NOT_OK(hp.ResolveAddresses(&resolved));
              :       proxy.reset(new ConsensusServiceProxy(messenger_, 
resolved[0], hp.host()));
              :     }
              :
We get the leader proxy unconditionally, so maybe pull it out of the 
conditional block? Also should we LookupTS the leader here too?


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/master/catalog_manager.cc@313
PS18, Line 313: runtime
This probably shouldn't be labeled "runtime" given the initialization of the 
thread happens when we start up.

I think "advanced" and "experimental" might be appropriate.


http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc
File src/kudu/rebalance/rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/18/src/kudu/rebalance/rebalancer.cc@528
PS18, Line 528: move_tablet_id
nit: can std::move(move_tablet_id)



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 18
Gerrit-Owner: Hannah Nguyen <hannah.ngu...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Hannah Nguyen <hannah.ngu...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 03 Mar 2020 21:47:06 +0000
Gerrit-HasComments: Yes

Reply via email to