Hannah Nguyen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14177 )

Change subject: KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 17:

(26 comments)

http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc@81
PS17, Line 81: &
> Having it written as it's now, it hints on the expectation of the number of
Done. Removed the similar reference below in AssignLocationsWithSkew() as well.


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc@97
PS17, Line 97: %
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc@359
PS17, Line 359: 1000
> Given the exponential back-off behavior of ASSERT_EVENTUALLY under CheckSom
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h
File src/kudu/master/auto_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@117
PS17, Line 117:   Status GetMoves(
> nit: can you also note how this can fail?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@132
PS17, Line 132: Marks replicas that
> nit: marks them with what?
Done. Updated comment.


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@137
PS17, Line 137:   Status ExecuteMoves(
> nit: can you also note how this can fail?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@139
PS17, Line 139: const bool& has_location_violations
> nit: booleans are trivially copyable so we don't need to pass them by const
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@178
PS17, Line 178: mutable
> Is 'mutable' is really needed here?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@188
PS17, Line 188: wake_up_cv_
> It wasn't clear to me that this is only used to signal that the rebalancer
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@199
PS17, Line 199:     // Check if shutdown was signaled.
              :     {
              :       lock_guard<Mutex> l(lock_);
              :       if (closing_) return;
              :     }
              :
> nit: maybe this should be
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@207
PS17, Line 207:       CatalogManager::ScopedLeaderSharedLock 
l(catalog_manager_);
              :       if (!l.first_failed_status().ok()) {
              :         moves_scheduled_this_round_for_test_ = 0;
              :         continue;
              :       }
> Should we at least sleep here? Otherwise this is a tight loop for followers
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@267
PS17, Line 267: wake_up_cv_.WaitFor(
              :           
MonoDelta::FromSeconds(FLAGS_auto_rebalancing_wait_for_replica_moves_seconds));
              :         if (closing_) {
              :           return;
              :         }
> This can be:
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@320
PS17, Line 320:   // If no placement policy violations are found, do load 
rebalancing.
              :   if (!replica_moves->empty()) {
              :     return Status::OK();
              :   }
> nit: move this into the policy fixer block?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@326
PS17, Line 326:   *has_location_violations = false;
> nit: update a local variable and swap at the end?
This bool isn't updated anywhere else; should I move the assignment to the end?


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@332
PS17, Line 332: /*cross_location=*/
> nit: don't need this now that the enum is self-documenting
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@343
PS17, Line 343: replica_moves
> nit: update a local variable and swap at the end?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@388
PS17, Line 388: replica_moves
> nit: update a local variable and swap at the end?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@450
PS17, Line 450:  Mark the specified destination tserver,
> nit: what mark are we making?
Done. Changed the comment.


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@526
PS17, Line 526:     RETURN_NOT_OK(catalog_manager_->GetAllTables(&table_infos));
> Do we need to RETURN_NOT_OK(leaderlock.first_failed_status())?
BuildClusterRawInfo() shouldn't be called if the master isn't the leader, so I 
didn't think RETURN_NOT_OK(leaderlock.first_failed_status()) was necessary here?

The test NextLeaderResumesAutoRebalancing() should verify that the new leader 
master picks up auto-rebalancing.


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@545
PS17, Line 545:       TabletMetadataLock l(tablet.get(), LockMode::READ);
> It'd be much less error-prone and much easier to read if this were called `
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@570
PS17, Line 570:       // Consensus state information is the same for all 
replicas of this tablet.
> Is this true? Or is it a benign approximation? My guess is that it's an app
Hmm, the latter is true (for auto-rebalancing, not sure for the rebalancer 
tool), other than counting the number of voter replicas down at L618.
If unused, should this function then not bother to populate the consensus_state 
field of the ReplicaSummary for each replica below? (L612)


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@700
PS17, Line 700: hange 'moves_complete' to false
> No longer applies?
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@707
PS17, Line 707:
> nit: remove the extra space
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@715
PS17, Line 715: // TODO: Retrieve consensus state information from the 
CatalogManager instead.
> warning: missing username/bug in TODO [google-readability-todo]
Done


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@748
PS17, Line 748: orig_leader_uuid
> Why is this change?
This was a bug in earlier iterations! I realized after reading through the 
warning log of tests that replica moves weren't being checked correctly.
The RPC to request consensus state should have been made to the tserver with 
the leader replica (L739 gets the proxy using the tserver with the leader 
replica). I updated the error message in L737 to clarify this.


http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@775
PS17, Line 775: *completion_status = Status::Incomplete(Substitute(
              :         "tablet $0, TS $1 -> TS $2 move failed, target replica 
disappeared",
              :         tablet_uuid, from_ts_uuid, to_ts_uuid));
> This isn't used if we return Status::OK(). Why not just return the error?
Done



--
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 17
Gerrit-Owner: Hannah Nguyen <hannah.ngu...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Hannah Nguyen <hannah.ngu...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 25 Feb 2020 07:19:30 +0000
Gerrit-HasComments: Yes

Reply via email to