Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: WIP KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 5:

(31 comments)

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@50
PS5, Line 50: FLAGS_auto_rebalancing_stop_flag
You probably meant to set the interval here.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@87
PS5, Line 87:
            :   if (!AllowSlowTests()) {
            :     LOG(WARNING) << "test is skipped; set KUDU_ALLOW_SLOW_TESTS=1 
to run";
            :     return;
            :   }
nit: you can use SKIP_IF_SLOW_NOT_ALLOWED() for this. Same elsewhere.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@104
PS5, Line 104:   SleepFor(MonoDelta::FromSeconds(10));
             :
             :   int leader_idx;
             :   ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
             :   for (int i=0; i<kNumMasters; i++) {
             :     if (i==leader_idx) {
             :       ASSERT_NE(0, number_of_loop_iterations(i));
             :     }
             :     else {
             :       ASSERT_EQ(0, moves_scheduled_this_round(i));
             :     }
             :   }
             :
Rather than sleeping this long, how about asserting with ASSERT_EVENTUALLY that 
the condition eventually becomes true.

nit: fix spacing


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@117
PS5, Line 117:   cluster_->Shutdown();
nit: don't need this since it gets called in TearDown()


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@144
PS5, Line 144:   auto iterations = number_of_loop_iterations(new_leader_idx);
             :   ASSERT_LT(0, iterations);
Should this be wrapped in ASSERT_EVENTUALLY? What if the thread hasn't started 
iterating yet?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@149
PS5, Line 149: TEST_F(AutoRebalancerTest, NoReplicaMovesIfBalanced) {
Consider breaking this into smaller tests so the cases are parallelizable.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@200
PS5, Line 200:     num_tservers = 2;
             :     num_tablets = 1;
It doesn't seem particularly useful to keep defining these separately, and IMO 
it hurts readability, since I still need to read through the rest of the below 
setup to understand the big picture.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@224
PS5, Line 224: SleepFor(MonoDelta::FromSeconds(10));
             :     ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
             :     ASSERT_EQ(moves_scheduled_this_round(leader_idx), 0);
I wonder if it's sufficient to just define "no moves" as having the loop 
iteration count increase, but having no moves scheduled. Then we wouldn't have 
to sleep so much. WDYT?

Also might be worth putting that into some CheckNoMovesScheduled() function.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@263
PS5, Line 263: ASSERT_GE
ASSERT_EQ()?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h
File src/kudu/master/auto_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@63
PS5, Line 63:   // the catalog manager must be in the process of initializing.
nit: Why does this need to be the case? The only invariant I see right now is 
that Init() cannot have previously succeeded (lest devs feel the wrath of the 
CHECK failure)


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@71
PS5, Line 71: int number_of_loop_iterations;
            :   int moves_scheduled_this_round;
Nit: could you suffix these with _for_test so it's obvious that their access 
can be mentally ignored when reading code


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@88
PS5, Line 88:   Status GetMovesUsingRebalancingAlgo(
            :       const rebalance::ClusterRawInfo& raw_info,
            :       std::vector<rebalance::Rebalancer::ReplicaMove>* 
replica_moves,
            :       rebalance::RebalancingAlgo* algo,
            :       bool cross_location = false) const;
            :
            :   Status 
GetMoves(std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves,
            :                   bool* policy_fixing) const;
            :
            :   Status GetTabletLeader(
            :       const std::string& tablet_id,
            :       std::string* leader_uuid,
            :       HostPort* leader_hp) const;
            :
            :   Status ExecuteMoves(
            :       const std::vector<rebalance::Rebalancer::ReplicaMove>& 
replica_moves,
            :       const bool& policy_fixing);
nit: add docs. What do these and their arguments do?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@122
PS5, Line 122: rebalancer_
Does this _need_ to be heap allocated?


http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc@377
PS3, Line 377:   unordered_set<string> tablets_in_move;
             :   for (const auto& move : moves) {
             :     vector<string> tablet_ids;
             :     RETURN_NOT_OK(rebalancer_->FindReplicas(move, raw_info, 
&tablet_ids));
             :     if (cross_location) {
             :       // In case of cross-location (a.k.a. inter-location) 
rebalancing it is
             :       // necessary to make sure the majority of replicas would 
not end up
             :       // at the same location after the move. If so, remove 
those tablets
             :       // from the list of candidates.
             :       
RETURN_NOT_OK(rebalancer_->FilterCrossLocationTabletCandidates(
             :           cluster_info.locality.location_by_ts_id, tpi, move, 
&tablet_ids));
             :     }
             :     // Shuffle the set of the tablet identifiers: that's to 
achieve even spread
             :     // of moves across tables with the same skew.
             :     std::shuffle(tablet_ids.begin(), tablet_ids.end(), 
random_device());
             :     string move_tablet_id;
             :     for (const auto& tablet_id : tablet_ids) {
             :       if (tablets_in_move.find(tablet_id) == 
tablets_in_move.end()) {
             :         // For now, choose the very first tablet that does not 
have replicas
             :         // in move. Later on, additional logic might be added to 
find
             :         // the best candidate.
             :         move_tablet_id = tablet_id;
             :         break;
             :       }
             :     }
> Yeah, maybe it'd make sense as a part of the rebalance namespace. 30 lines
As a reviewer, my main gripe with this is that in reviewing this patch, it's 
pretty difficult to distinguish the new code from code that has been used 
already in existing rebalancer code.

Code reuse also hopefully makes further changes to this code easier, since we 
wouldn't have to worry about updating multiple copy-paste sites.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@200
PS5, Line 200:         
SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
Should we just sleep at the top of the loop so we don't have to repeat this 
before every continue?

Alternatively, check out SCOPED_CLEANUP(), which is helpful to ensure something 
gets run when leaving scope.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@209
PS5, Line 209: cluster_info_
Nit: since we're blowing these away every iteration, how about just making 
these local variables and passing them around as const refs? Then it's more 
obviously correct that eg calls to Get moved are using the correct 
cluster_info_ computed in this iteration.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@215
PS5, Line 215: s.ToString() << ": could not retrieve cluster info";
nit: consider using Substitute() from gutil/strings/substitute.h here, e.g. 
Substitute("Could not retrieve cluster info: $0", s.ToString()); same elsewhere

Also maybe this should be a warning?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@221
PS5, Line 221: (s.ok())
nit: drop the extra parens; same elsewhere


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@238
PS5, Line 238: this->I
nit: drop this->?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@245
PS5, Line 245:     if (!FLAGS_auto_rebalancing_stop_flag) {
We should probably move this check to the top of the loop so we don't go 
through all of the above steps needlessly.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@274
PS5, Line 274: bool* policy_fixing
nit: it wasn't immediately obvious to me from "policy fixing" that this is 
referring to location awareness. Maybe rename to "should_fix_locations" or 
"has_location_violations" or something?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@286
PS5, Line 286: replica_moves
Nit: space before


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@336
PS5, Line 336: bool cross_location
nit: for readability, consider using an enum like CrossLocations::{YES, NO}


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@334
PS5, Line 334: vector<Rebalancer::ReplicaMove>* replica_moves,
             :   rebalance::RebalancingAlgo* algo,
             :   bool cross_location
nit: mind reordering these so the pure out param (replica_moves) comes after 
the in/out and input params?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@474
PS5, Line 474:       unique_ptr<ConsensusServiceProxy> proxy;
Does this have to be heap-allocated? Same elsewhere.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@681
PS5, Line 681: Is
nit: sounds a bit awkward. Maybe remove? or "AreCluster.."?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@687
PS5, Line 687: max_replica_count-min_replica_count
nit: add spaces, same below


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@691
PS5, Line 691: const auto min_table_skew = table_skew_info.begin()->first;
             :   const auto m
const auto&?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@709
PS5, Line 709:   // One location: table and cluster skew?
nit: complete sentences in comments are preferred. Same elsewhere.


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc@5202
PS5, Line 5202:
              : TSManager* CatalogManager::ts_manager() const {
              :   return master_->ts_manager();
              : }
              :
nit: Rather than exposing this from the catalog manager, how about passing it 
into the constructor of AutoRebalancerTask?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h
File src/kudu/rebalance/rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h@147
PS5, Line 147:   // Public accessors for fields of Rebalancer's config object.
             :   bool ShouldRunPolicyFixer() const;
             :   bool ShouldRunCrossLocationRebalancing() const;
             :   bool ShouldRunIntraLocationRebalancing()  const;
Don't need these anymore



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 5
Gerrit-Owner: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Hannah Nguyen <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 23 Sep 2019 05:43:34 +0000
Gerrit-HasComments: Yes

Reply via email to