Hannah Nguyen has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )
Change subject: KUDU-2780: create thread for auto-rebalancing ...................................................................... Patch Set 17: (26 comments) http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc File src/kudu/master/auto_rebalancer-test.cc: http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc@81 PS17, Line 81: & > Having it written as it's now, it hints on the expectation of the number of Done. Removed the similar reference below in AssignLocationsWithSkew() as well. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc@97 PS17, Line 97: % > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer-test.cc@359 PS17, Line 359: 1000 > Given the exponential back-off behavior of ASSERT_EVENTUALLY under CheckSom Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h File src/kudu/master/auto_rebalancer.h: http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@117 PS17, Line 117: Status GetMoves( > nit: can you also note how this can fail? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@132 PS17, Line 132: Marks replicas that > nit: marks them with what? Done. Updated comment. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@137 PS17, Line 137: Status ExecuteMoves( > nit: can you also note how this can fail? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@139 PS17, Line 139: const bool& has_location_violations > nit: booleans are trivially copyable so we don't need to pass them by const Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@178 PS17, Line 178: mutable > Is 'mutable' is really needed here? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.h@188 PS17, Line 188: wake_up_cv_ > It wasn't clear to me that this is only used to signal that the rebalancer Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc File src/kudu/master/auto_rebalancer.cc: http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@199 PS17, Line 199: // Check if shutdown was signaled. : { : lock_guard<Mutex> l(lock_); : if (closing_) return; : } : > nit: maybe this should be Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@207 PS17, Line 207: CatalogManager::ScopedLeaderSharedLock l(catalog_manager_); : if (!l.first_failed_status().ok()) { : moves_scheduled_this_round_for_test_ = 0; : continue; : } > Should we at least sleep here? Otherwise this is a tight loop for followers Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@267 PS17, Line 267: wake_up_cv_.WaitFor( : MonoDelta::FromSeconds(FLAGS_auto_rebalancing_wait_for_replica_moves_seconds)); : if (closing_) { : return; : } > This can be: Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@320 PS17, Line 320: // If no placement policy violations are found, do load rebalancing. : if (!replica_moves->empty()) { : return Status::OK(); : } > nit: move this into the policy fixer block? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@326 PS17, Line 326: *has_location_violations = false; > nit: update a local variable and swap at the end? This bool isn't updated anywhere else; should I move the assignment to the end? http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@332 PS17, Line 332: /*cross_location=*/ > nit: don't need this now that the enum is self-documenting Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@343 PS17, Line 343: replica_moves > nit: update a local variable and swap at the end? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@388 PS17, Line 388: replica_moves > nit: update a local variable and swap at the end? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@450 PS17, Line 450: Mark the specified destination tserver, > nit: what mark are we making? Done. Changed the comment. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@526 PS17, Line 526: RETURN_NOT_OK(catalog_manager_->GetAllTables(&table_infos)); > Do we need to RETURN_NOT_OK(leaderlock.first_failed_status())? BuildClusterRawInfo() shouldn't be called if the master isn't the leader, so I didn't think RETURN_NOT_OK(leaderlock.first_failed_status()) was necessary here? The test NextLeaderResumesAutoRebalancing() should verify that the new leader master picks up auto-rebalancing. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@545 PS17, Line 545: TabletMetadataLock l(tablet.get(), LockMode::READ); > It'd be much less error-prone and much easier to read if this were called ` Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@570 PS17, Line 570: // Consensus state information is the same for all replicas of this tablet. > Is this true? Or is it a benign approximation? My guess is that it's an app Hmm, the latter is true (for auto-rebalancing, not sure for the rebalancer tool), other than counting the number of voter replicas down at L618. If unused, should this function then not bother to populate the consensus_state field of the ReplicaSummary for each replica below? (L612) http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@700 PS17, Line 700: hange 'moves_complete' to false > No longer applies? Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@707 PS17, Line 707: > nit: remove the extra space Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@715 PS17, Line 715: // TODO: Retrieve consensus state information from the CatalogManager instead. > warning: missing username/bug in TODO [google-readability-todo] Done http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@748 PS17, Line 748: orig_leader_uuid > Why is this change? This was a bug in earlier iterations! I realized after reading through the warning log of tests that replica moves weren't being checked correctly. The RPC to request consensus state should have been made to the tserver with the leader replica (L739 gets the proxy using the tserver with the leader replica). I updated the error message in L737 to clarify this. http://gerrit.cloudera.org:8080/#/c/14177/17/src/kudu/master/auto_rebalancer.cc@775 PS17, Line 775: *completion_status = Status::Incomplete(Substitute( : "tablet $0, TS $1 -> TS $2 move failed, target replica disappeared", : tablet_uuid, from_ts_uuid, to_ts_uuid)); > This isn't used if we return Status::OK(). Why not just return the error? Done -- To view, visit http://gerrit.cloudera.org:8080/14177 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4 Gerrit-Change-Number: 14177 Gerrit-PatchSet: 17 Gerrit-Owner: Hannah Nguyen <hannah.ngu...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hannah Nguyen <hannah.ngu...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Tue, 25 Feb 2020 07:19:30 +0000 Gerrit-HasComments: Yes