[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554582#comment-16554582 ]
Josh Elser commented on HBASE-20919: ------------------------------------ {quote}Delete old patch. {quote} You can leave the old patches attached, [~HB-CY]! We like seeing how your patches evolve :) {quote}That is a beautiful diagram {quote} +1 on this. You have made a very nice diagram to explain this complicated setup! {code:java} @@ -156,6 +158,7 @@ public class RSGroupBasedLoadBalancer implements RSGroupableBalancer { @Override public Map<ServerName, List<RegionInfo>> roundRobinAssignment( List<RegionInfo> regions, List<ServerName> servers) throws HBaseIOException { + checkAndWaitInitialization();{code} What about failing fast here, and having the caller decide how to handle the retry logic? AssignmentManager should already have logic to do this. I am worried about this eventually converging. Specifically, RSGroupLoadBalancer doesn't get initialized until after hbase:meta gets assigned, but hbase:meta can't be assigned until the RSGroupLoadBalancer is initialized so we soft-lock. Have I missed something, [~HB-CY]? This is hard because, while I don't disagree with Stack's comment about StochasticLB to RSGroupLB, the Master using the LoadBalancer before it was initialized is bad. If you're seeing this on 2.0.1, then HBASE-20708 isn't related yet.. Do you have more logs you can share? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -------------------------------------------------------------------------- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup > Affects Versions: 2.0.1 > Reporter: chenyang > Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > <property> > <name>hbase.coprocessor.master.classes</name> > <value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value> > </property> > <property> > <name>hbase.master.loadbalancer.class</name> > <value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value> > </property> > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is > no online region server, so the procedure can not be executed completely. > Then we shut down master, the ServerCrashProcedure and it`a subProcedures are > stored into procedureStore. > > When we restart master, at first, the master blocks waiting for becoming > active master. after becoming active master, it starts procedureExecutor. > The procedureExecutor start to read procedure from procedureStore and the pre > serverCrashProcedure submit a assign region task to assignmentManager`s > queue. The processQueue thread and active-master thread block waiting for > online region servers. when we start a region server, the active-master > thread do some operations and init rsgroup balancer. At the same time, the > processQueue thread start to call balancer. If the processQueue thread run > faster than active master, the processQueue thread will throw NPE. As a > result, the procedure not complete and not release hbase:meta region lock > forever. > > Now, my solution is that initializing the balancer before calling > startServiceThreads in finishActiveMasterInitialization() of HMaster.But this > may have some side effects for master. > Based on stack`s suggestion, i re-submit a new patch which waiting for > initializing rsgroup balancer before calling balance-methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)