Duo Zhang created HBASE-30093:
---------------------------------
Summary: LoadBalancer related tests timed out
Key: HBASE-30093
URL: https://issues.apache.org/jira/browse/HBASE-30093
Project: HBase
Issue Type: Sub-task
Components: Balancer, test
Reporter: Duo Zhang
Sonnet 4.5(4.6?) summary
TestStochasticLoadBalancerRegionReplicaSameHosts
Root cause
StochasticLoadBalancer.balanceTable() set cluster.setStopRequestedAt(startTime
+ maxRunningTime) at the very beginning of the method, before build ing costs
(initCosts), the first computeCost, needsBalance, and other setup.
So the entire maxRunningTime budget (e.g. 250 ms in StochasticBalancerTestBase)
was counting initialization + search, not just the stochastic walk.
On a slow or busy CI host, setup alone could take ~250 ms or more. Then when
the walk started, isStopRequested() was already true (or became true af ter a
single rejected move). The loop could exit with step still 0, no accepted
moves, and balanceCluster returning null even though region repl icas were
still colocated on the same host—so
TestStochasticLoadBalancerRegionReplicaSameHosts failed. Locally, setup is
usually only tens of ms, so the bug rarely shows up.
Fix
Set setStopRequestedAt only after initialization is done, immediately before
the stochastic for loop, using a new timestamp (searchStartTime +
maxRunningTime). That way maxRunningTime applies to the search phase only,
which matches the intended meaning of the limit and avoids “no search time
left” on loaded agents.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)