[ 
https://issues.apache.org/jira/browse/HBASE-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21260:
------------------------------
        Fix Version/s: 2.2.0
                       3.0.0
    Affects Version/s:     (was: 2.1.0)
                           (was: 2.0.0)
               Status: Patch Available  (was: Open)

> The whole balancer plans might be aborted if there are more than one plans to 
> move a same region 
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21260
>                 URL: https://issues.apache.org/jira/browse/HBASE-21260
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer, master
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>             Fix For: 3.0.0, 2.2.0
>
>         Attachments: HBASE-21260.branch-2.001.patch, 
> HBASE-21260.branch-2.002.patch
>
>
> In SimpleLoadBalancer, plans are generated firstly by average number regions 
> per server for a table. Each server will be randomly assigned either 
> floor(average) or ceiling(average) regions (if the average is not an integer 
> number). But afterwards, the balanceOverall method might generate new plans 
> of some regions of the table to balance server loads in whole cluster scope. 
> As a result, there are plans to move a same region in one call of balance. 
> Currently, branch-2 is using async procedures to implement balancer plans. 
> But the concurrency of moving the same regions will cause the balance method 
> failed. And all the afterwards plans will not be implement when one plan 
> encounters exception.
> We have encountered this problem in our practices, the logs are as follows,
> {color:#205081}2018-09-26,12:12:38,224 INFO 
> [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.HMaster: Balancer plans size is 3757, the 
> balance interval is 79 ms, and the max number regions in transition is 25
> 2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, 
> source=c4-hadoop-tst-st99.bj,52900,1537522783781, 
> destination=c4-hadoop-tst-st28.bj,52900,1537520009497
> 2018-09-26,12:12:38,325 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, 
> source=c4-hadoop-tst-st99.bj,52900,1537522783781, 
> destination=c4-hadoop-tst-st29.bj,52900,1537522784188
> 2018-09-26,12:12:38,325 INFO [PEWorker-16] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: 
> pid=119197, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, REOPEN/MOVE 
> checking lock on 1588230740
> 2018-09-26,12:12:38,325 ERROR [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.balancer.BalancerChore: Failed to balance.
> org.apache.hadoop.hbase.HBaseIOException: rit=OPEN, 
> location=c4-hadoop-tst-st99.bj,52900,1537522783781, table=hbase:meta, 
> region=1588230740 is currently in transition
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:536)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:592)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:609)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1622)
>         at 
> org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at 
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745){color}
> This is a serious problem because it often occurs when new RSs started or old 
> RSs failover. And what's more, no effective methods can be used to make the 
> balance of the cluster back to normal.
> But the solution of this problem may be simple. We can cache Exceptions when 
> implementing a plan, and then just skip it, avoiding failed plans effect 
> later plans in the whole plans list. New calls of balance can fetch up the 
> failed and skipped plans.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to