[jira] [Comment Edited] (HBASE-21260) The whole balancer plans might be aborted if there are more than one plans to move a same region

Xiaolin Ha (JIRA) Wed, 10 Oct 2018 00:56:59 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644599#comment-16644599
 ]


Xiaolin Ha edited comment on HBASE-21260 at 10/10/18 7:55 AM:
--------------------------------------------------------------

[~xucang]rpCount means region plan count, not 'rpcCount'. We increase rpCount 
regardless of whether a plan succeeds will not make the calculation less 
accurate than original intention, because  rpCount is just used for throttling 
and making balance time limited in maxBlancingTime.


was (Author: xiaolin ha):
rpCount means region plan count, not 'rpcCount'. We increase rpCount regardless 
of whether a plan succeeds will not make the calculation less accurate than 
original intention, because  rpCount is just used for throttling and making 
balance time limited in maxBlancingTime.

> The whole balancer plans might be aborted if there are more than one plans to 
> move a same region 
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21260
>                 URL: https://issues.apache.org/jira/browse/HBASE-21260
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer, master
>    Affects Versions: 2.1.0, 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>         Attachments: HBASE-21260.branch-2.001.patch, 
> HBASE-21260.branch-2.002.patch
>
>
> In SimpleLoadBalancer, plans are generated firstly by average number regions 
> per server for a table. Each server will be randomly assigned either 
> floor(average) or ceiling(average) regions (if the average is not an integer 
> number). But afterwards, the balanceOverall method might generate new plans 
> of some regions of the table to balance server loads in whole cluster scope. 
> As a result, there are plans to move a same region in one call of balance. 
> Currently, branch-2 is using async procedures to implement balancer plans. 
> But the concurrency of moving the same regions will cause the balance method 
> failed. And all the afterwards plans will not be implement when one plan 
> encounters exception.
> We have encountered this problem in our practices, the logs are as follows,
> {color:#205081}2018-09-26,12:12:38,224 INFO 
> [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.HMaster: Balancer plans size is 3757, the 
> balance interval is 79 ms, and the max number regions in transition is 25
> 2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, 
> source=c4-hadoop-tst-st99.bj,52900,1537522783781, 
> destination=c4-hadoop-tst-st28.bj,52900,1537520009497
> 2018-09-26,12:12:38,325 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, 
> source=c4-hadoop-tst-st99.bj,52900,1537522783781, 
> destination=c4-hadoop-tst-st29.bj,52900,1537522784188
> 2018-09-26,12:12:38,325 INFO [PEWorker-16] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: 
> pid=119197, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, REOPEN/MOVE 
> checking lock on 1588230740
> 2018-09-26,12:12:38,325 ERROR [master/c4-hadoop-tst-ct15:52900.Chore.1] 
> org.apache.hadoop.hbase.master.balancer.BalancerChore: Failed to balance.
> org.apache.hadoop.hbase.HBaseIOException: rit=OPEN, 
> location=c4-hadoop-tst-st99.bj,52900,1537522783781, table=hbase:meta, 
> region=1588230740 is currently in transition
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:536)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:592)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:609)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1622)
>         at 
> org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at 
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745){color}
> This is a serious problem because it often occurs when new RSs started or old 
> RSs failover. And what's more, no effective methods can be used to make the 
> balance of the cluster back to normal.
> But the solution of this problem may be simple. We can cache Exceptions when 
> implementing a plan, and then just skip it, avoiding failed plans effect 
> later plans in the whole plans list. New calls of balance can fetch up the 
> failed and skipped plans.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21260) The whole balancer plans might be aborted if there are more than one plans to move a same region

Reply via email to