[ 
https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555316#comment-16555316
 ] 

chenyang commented on HBASE-20919:
----------------------------------

Submit HBASE-20919-branch-2.0-02.patch which implements fast failing and logs.

Testing steps(there are one master and one rs):

1: start master without 02.patch

2: start rs without 02.patch 

    now, the cluster works fine

3: stop rs

4: stop master

5: restart master

6: restart rs

    now, the hbase:meta region can not be assign successfully.

7: stop rs

8: stop master

9: apply 02.path to branch-2.0, recompile hbase-rsgroup module and replace 
hbase-rsgroup-2.0.2-SNAPSHOT.jar with new version which includes 02.patch

10: restart master

11: restart rs

now, the hbase:meta region can be assign successfully. cluster works fine.

 

Logs:

hbase-hbase-master-bjpg-rs4729.yz02.log.no_02patch includes logs across 1 to 8 
steps.

In the log file, you can see RSGroupInfoManagerImpl$RSGroupStartupWorker kept 
trying to check wether meta region is online, but failed every time. 

 
{code:java}
2018-07-25 12:15:15,064 INFO 
[org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4729.yz02,16000,1532491114549]
 zookeeper.Me
taTableLocator: Failed verification of hbase:meta,,1 at 
address=bjpg-rs4736.yz02,16020,1532490935452, 
exception=org.apache.hadoop.hbase.NotServingRegionExcep
tion: hbase:meta,,1 is not online on bjpg-rs4736.yz02,16020,1532491949108
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3246)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3223)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
{code}
hbase-hbase-master-bjpg-rs4729.yz02.log.with_02patch includes logs across 9 to 
10 steps.

In the log file, you can see that hbase:meta region was assigned successfully 
finally after failing some times. 
{code:java}
2018-07-25 14:27:12,356 WARN [master/bjpg-rs4729:16000] 
rsgroup.RSGroupBasedLoadBalancer: RSGroupBasedLoadBalancer has not been 
initialized
org.apache.hadoop.hbase.HBaseIOException: RSGroupBasedLoadBalancer has not been 
initialized
at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.checkInitializedState(RSGroupBasedLoadBalancer.java:480)
at 
org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:161){code}

> meta region can't be re-onlined when restarting cluster if opening rsgroup
> --------------------------------------------------------------------------
>
>                 Key: HBASE-20919
>                 URL: https://issues.apache.org/jira/browse/HBASE-20919
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer, master, rsgroup
>    Affects Versions: 2.0.1
>            Reporter: chenyang
>            Priority: Major
>         Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png, 
> hbase-hbase-master-bjpg-rs4730.yz02.log.test
>
>
> if you open rsgroup, hbase-site.xml contains  below configuration.
> {code:java}
> <property>
>   <name>hbase.coprocessor.master.classes</name>
>   <value>org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint</value>
> </property>
> <property>
>   <name>hbase.master.loadbalancer.class&lt;/name>
>  <value>org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer</value>
> </property>
> {code}
> And you shut down the whole HBase cluster in the way:
>  # first shut down region server one by one
>  # shut down master
> Then you restart whole cluster in the way:
>  # start master
>  # start regionserver
> The hbase:meta region can not be re-online and the rsgroup can not be 
> initialized successfully.
>  master logs:
> {code:java}
> 2018-07-12 18:27:08,775 INFO 
> [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
>  rsgroup.RSGro
> upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come 
> online
> 2018-07-12 18:27:08,876 INFO 
> [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409]
>  zookeeper.Met
> aTableLocator: Failed verification of hbase:meta,,1 at 
> address=bjpg-rs4732.yz02,60020,1531388712053, 
> exception=org.apache.hadoop.hbase.NotServingRegionExcepti
> on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249)
> at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
> at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729)
> at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
> {code}
> The logs show that hbase:meta region is not online and rsgroup keeps retrying 
> to initialize.
>   
>  but why the hbase:meta region is not online?
>  The info-level logs and jstack had not enough infomation, so I added some 
> debug logs in test-source-code. Then i checked the master`s logs and region 
> server`s logs, and found the meta region assign procedure which hold the meta 
> region lock not completed and not released the lock forever, so the 
> recoverMetaProcedure could not be executed. 
>   
>  Why the first procedure not completed and not released meta region lock?
>  In the test logs, i found when assignmentManager assigned the region, it 
> need to call the rsgroup balancer which  have not been initialized 
> completely, so throw NPE.  As a result, the procedure not completed and not 
> released the lock forever.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262)
> at 
> org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693)
> {code}
> !bug2.png!
> As shown in the figure named bug2.png listed in attachments, when we shutdown 
> the last region server, the master submit a ServerCrashProcedure. In the 
> procedure, it will reassign hbase:meta region, but at that moment, there is 
> no online region server, so the procedure can not be executed completely. 
> Then we shut down master, the ServerCrashProcedure and it`a subProcedures are 
> stored into procedureStore.
>   
>  When we restart master, at first,  the master blocks waiting for becoming 
> active master.  after becoming active master, it starts procedureExecutor. 
> The procedureExecutor start to read procedure from procedureStore and the pre 
> serverCrashProcedure submit a assign region task to assignmentManager`s 
> queue. The processQueue thread and active-master thread block waiting for 
> online region servers. when we start a region server, the active-master 
> thread do some operations and init rsgroup balancer.  At the same time, the 
> processQueue thread start to call balancer. If the processQueue thread run 
> faster than active master,  the processQueue thread will throw NPE.  As a 
> result, the procedure not complete and not release hbase:meta region lock 
> forever.
>   
>   Now, my solution is  that initializing the balancer before calling 
> startServiceThreads in finishActiveMasterInitialization() of HMaster.But this 
> may have some side effects for master.
>   Based on stack`s suggestion, i re-submit a new patch which waiting for 
> initializing rsgroup balancer before calling balance-methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to