[ https://issues.apache.org/jira/browse/HBASE-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644599#comment-16644599 ]
Xiaolin Ha edited comment on HBASE-21260 at 10/10/18 7:55 AM: -------------------------------------------------------------- [~xucang]rpCount means region plan count, not 'rpcCount'. We increase rpCount regardless of whether a plan succeeds will not make the calculation less accurate than original intention, because rpCount is just used for throttling and making balance time limited in maxBlancingTime. was (Author: xiaolin ha): rpCount means region plan count, not 'rpcCount'. We increase rpCount regardless of whether a plan succeeds will not make the calculation less accurate than original intention, because rpCount is just used for throttling and making balance time limited in maxBlancingTime. > The whole balancer plans might be aborted if there are more than one plans to > move a same region > ------------------------------------------------------------------------------------------------- > > Key: HBASE-21260 > URL: https://issues.apache.org/jira/browse/HBASE-21260 > Project: HBase > Issue Type: Bug > Components: Balancer, master > Affects Versions: 2.1.0, 2.0.0 > Reporter: Xiaolin Ha > Assignee: Xiaolin Ha > Priority: Major > Attachments: HBASE-21260.branch-2.001.patch, > HBASE-21260.branch-2.002.patch > > > In SimpleLoadBalancer, plans are generated firstly by average number regions > per server for a table. Each server will be randomly assigned either > floor(average) or ceiling(average) regions (if the average is not an integer > number). But afterwards, the balanceOverall method might generate new plans > of some regions of the table to balance server loads in whole cluster scope. > As a result, there are plans to move a same region in one call of balance. > Currently, branch-2 is using async procedures to implement balancer plans. > But the concurrency of moving the same regions will cause the balance method > failed. And all the afterwards plans will not be implement when one plan > encounters exception. > We have encountered this problem in our practices, the logs are as follows, > {color:#205081}2018-09-26,12:12:38,224 INFO > [master/c4-hadoop-tst-ct15:52900.Chore.1] > org.apache.hadoop.hbase.master.HMaster: Balancer plans size is 3757, the > balance interval is 79 ms, and the max number regions in transition is 25 > 2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] > org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, > source=c4-hadoop-tst-st99.bj,52900,1537522783781, > destination=c4-hadoop-tst-st28.bj,52900,1537520009497 > 2018-09-26,12:12:38,325 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] > org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, > source=c4-hadoop-tst-st99.bj,52900,1537522783781, > destination=c4-hadoop-tst-st29.bj,52900,1537522784188 > 2018-09-26,12:12:38,325 INFO [PEWorker-16] > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: > pid=119197, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; > TransitRegionStateProcedure table=hbase:meta, region=1588230740, REOPEN/MOVE > checking lock on 1588230740 > 2018-09-26,12:12:38,325 ERROR [master/c4-hadoop-tst-ct15:52900.Chore.1] > org.apache.hadoop.hbase.master.balancer.BalancerChore: Failed to balance. > org.apache.hadoop.hbase.HBaseIOException: rit=OPEN, > location=c4-hadoop-tst-st99.bj,52900,1537522783781, table=hbase:meta, > region=1588230740 is currently in transition > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:536) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:592) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:609) > at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707) > at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1622) > at > org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){color} > This is a serious problem because it often occurs when new RSs started or old > RSs failover. And what's more, no effective methods can be used to make the > balance of the cluster back to normal. > But the solution of this problem may be simple. We can cache Exceptions when > implementing a plan, and then just skip it, avoiding failed plans effect > later plans in the whole plans list. New calls of balance can fetch up the > failed and skipped plans. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)