Let me try out the below in a unit test. Good on you Lars, S On Wed, Nov 9, 2016 at 2:46 AM, Lars George <lars.geo...@gmail.com> wrote:
> Wait, I am getting more confused (why must this be so hard to read?), > as in HMaster.balance() we have this > > long cutoffTime = System.currentTimeMillis() + maximumBalanceTime; > int rpCount = 0; // number of RegionPlans balanced so far > long totalRegPlanExecTime = 0; > if (plans != null && !plans.isEmpty()) { > for (RegionPlan plan: plans) { > LOG.info("balance " + plan); > long balStartTime = System.currentTimeMillis(); > //TODO: bulk assign > this.assignmentManager.balance(plan); > totalRegPlanExecTime += System.currentTimeMillis()-balStartTime; > rpCount++; > if (rpCount < plans.size() && > // if performing next balance exceeds cutoff time, exit the loop > (System.currentTimeMillis() + (totalRegPlanExecTime / > rpCount)) > cutoffTime) { > //TODO: After balance, there should not be a cutoff time > (keeping it as a security net for now) > LOG.debug("No more balancing till next balance run; > maximumBalanceTime=" + > maximumBalanceTime); > break; > } > } > } > > So the cut off time is T1 (assuming this is now when the balance() > call was made by the chore) plus the extra 5mins. rpCount is 0 and the > accrued time it took for plans to execute is also 0. Let's also assume > that we have many plans to execute, to trigger the checks. > > We record the time the assignmentManager.balance(plan) call took, and > add this to the accrued time counter, while increasing the plan > counter. We loop now until we executed all plans _or_ (and that is the > part I am after), until the current time plus "the accrued plan > execution time divided by the their count" exceeds the cut off time > computed above. > > For starters, why do we add the "rpCount < plans.size()" check at all > into that condition? What does that actually help or do? Just to skip > the very last check when the loop end and timeout coincide? > > Next, what is that division doing in the time check? It computes the > average it seems (total time divided by the count) so say if 5 plans > ran, and each took 4 mins, then the time we check "Tcurr + ( 5 * 4 > mins) / 5 > T1 + 5 mins", or "T1 + 20 mins + 4 mins > T1 + 5 mins". So > this already triggered way earlier and won't get to this. I am at a > loss to explain the logic behind this as I do not know what we usually > see in terms of number of plans and their execution times. > > But even assuming say 4 plans at 1 min each, we get "T1 + 4 mins + 1 > mins > T1 + 5 mins" - so is adding the average to skip the last one on > the assumption it would exceed the total balance time (the cut off)? > Seems as such. > > It then runs say 4 mins, sleeps 1 min, and then starts again due to > the chore firing. In the end, we continuously rearrange the cluster, > no? > > > > On Wed, Nov 9, 2016 at 11:14 AM, Lars George <lars.geo...@gmail.com> > wrote: > > Hi, > > > > In HMaster.java we have this code (added, mostly, in HBASE-3422): > > > > private int getBalancerCutoffTime() { > > int balancerCutoffTime = > > getConfiguration().getInt("hbase.balancer.max.balancing", -1); > > if (balancerCutoffTime == -1) { > > // No time period set so create one > > int balancerPeriod = > > getConfiguration().getInt("hbase.balancer.period", 300000); > > balancerCutoffTime = balancerPeriod; > > // If nonsense period, set it to balancerPeriod > > if (balancerCutoffTime <= 0) balancerCutoffTime = balancerPeriod; > > } > > return balancerCutoffTime; > > } > > > > Looking at these lines in particular, it seems that HBASE-8119 has > > added the unconditional line, make the conditional obsolete: > > > > balancerCutoffTime = balancerPeriod; > > // If nonsense period, set it to balancerPeriod > > if (balancerCutoffTime <= 0) balancerCutoffTime = balancerPeriod; > > > > Shall I just create a JIRA, patch and remove the conditional assignment? > > > > And since the cut off time is not set, it defaults to the same 5mins > > as the balancer period resolves to, making this basically a continuous > > process, as the chore is calling balance every 5 mins, and it then > > runs for 5 mins before stopping. How does this address the issue > > reported in HBASE-3422? > > > > Best, > > Lars >