Duo Zhang created HBASE-30093:
---------------------------------

             Summary: LoadBalancer related tests timed out
                 Key: HBASE-30093
                 URL: https://issues.apache.org/jira/browse/HBASE-30093
             Project: HBase
          Issue Type: Sub-task
          Components: Balancer, test
            Reporter: Duo Zhang


Sonnet 4.5(4.6?) summary

TestStochasticLoadBalancerRegionReplicaSameHosts

Root cause
StochasticLoadBalancer.balanceTable() set cluster.setStopRequestedAt(startTime 
+ maxRunningTime) at the very beginning of the method, before build ing costs 
(initCosts), the first computeCost, needsBalance, and other setup.

So the entire maxRunningTime budget (e.g. 250 ms in StochasticBalancerTestBase) 
was counting initialization + search, not just the stochastic walk.

On a slow or busy CI host, setup alone could take ~250 ms or more. Then when 
the walk started, isStopRequested() was already true (or became true af ter a 
single rejected move). The loop could exit with step still 0, no accepted 
moves, and balanceCluster returning null even though region repl icas were 
still colocated on the same host—so 
TestStochasticLoadBalancerRegionReplicaSameHosts failed. Locally, setup is 
usually only tens of ms, so the bug rarely shows up.

Fix
Set setStopRequestedAt only after initialization is done, immediately before 
the stochastic for loop, using a new timestamp (searchStartTime + 
maxRunningTime). That way maxRunningTime applies to the search phase only, 
which matches the intended meaning of the limit and avoids “no search time 
left” on loaded agents.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to