[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552030#comment-16552030 ] Ted Yu commented on HBASE-20919: Instead of a png file, can you attach patch to this JIRA ? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: bug2.png, fix-bugs.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class</name> > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is > no online region server, so the procedure can not be executed completely. > Then we shut down master, the ServerCrashProcedure and it`a subProcedures
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1655#comment-1655 ] chenyang commented on HBASE-20919: -- today(20180723), i will test again and attach a patch. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: bug2.png, fix-bugs.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class</name> > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is > no online region server, so the procedure can not be executed completely. > Then we shut down master, the ServerCrashProcedure and it`a subProcedures
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552326#comment-16552326 ] chenyang commented on HBASE-20919: -- attach HBASE-20919.branch-2.0.1.0001.patch > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919.branch-2.0.1.0001.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class</name> > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is > no online region server, so the procedure can not be executed completely. > Then we shut down master, the ServerCrashProcedure and it`a
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552565#comment-16552565 ] Hadoop QA commented on HBASE-20919: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 3s{color} | {color:red} HBASE-20919 does not apply to 0.1. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.7.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-20919 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12932640/HBASE-20919.branch-2.0.1.0001.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/13738/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919.branch-2.0.1.0001.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class</name> > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.h
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553301#comment-16553301 ] stack commented on HBASE-20919: --- That is a beautiful diagram @chenyang What is difference between default balancer and the rsgroup balancer that it NPEs sir? The default balancer is fine if it is called before it initialized completely. Should we change the rsgroup balancer so it is the same? The startup changes in hbase-2.1.0 with the commit of: HBASE-20708 Remove the usage of RecoverMetaProcedure in master startup Do you think the problem exists in 2.1+? Patch seems fine. I'm just wary changing startup order. It is an overly-complex process. I'm a little worried we'll bring on a new, different-type of issue. Thank you. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919.branch-2.0.1.0001.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(A
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553743#comment-16553743 ] chenyang commented on HBASE-20919: -- hi, Hadoop QA. when i git clone git://git.apache.org/hbase.git, i get errors below: _remote: Counting objects: 640127, done. remote: Compressing objects: 100% (132871/132871), done. fatal: The remote end hung up unexpectedly5.15 MiB | 203.00 KiB/s fatal: early EOF fatal: index-pack failed_ I google and config my git: _user.name=chenyang user.email=cheny...@kuaishou.com http.postbuffer=1148576000 core.compression=0 core.packedgitlimit=512m core.packedgitwindowsize=512m pack.deltacachesize=2047m pack.packsizelimit=2047m pack.windowmemory=2047m_ But, it does not work, so I download 2.0.2 hbase-2.0.1-src.tar.gz, and create patch based on my own branch-master. I think it is the reason why HBASE-20919 does not apply to 0.1. sorry, this is my first submit issue and patch, I will fix the git problem and re-create patch on right branch. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919.branch-2.0.1.0001.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553754#comment-16553754 ] chenyang commented on HBASE-20919: -- hi, stack. it a google suggestion. the rsgroup balancer constructs initializes rsGroupInfoManager and internalBalancer when initialized, and the default balancerStochasticLoadBalancer) does nothing. so rsgroup balancer will throw NPE if it is called before calling it`s initialize() method. I will test 2.1.0 and submit a new patch based on RSGroupBasedLoadBalancer to avoid side effects. Thank you > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919.branch-2.0.1.0001.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png l
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554338#comment-16554338 ] chenyang commented on HBASE-20919: -- According Hadoop QA and stack`s suggestion, recreate patch and retest the case based on branch 2.0. Delete old patch. Submit new patch named branch-2.0.patch created with dev-support/submit-patch.py. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: branch-2.0.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is > no online region se
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554344#comment-16554344 ] Hadoop QA commented on HBASE-20919: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 4s{color} | {color:red} HBASE-20919 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.7.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-20919 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12932897/branch-2.0.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/13762/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: branch-2.0.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554419#comment-16554419 ] chenyang commented on HBASE-20919: -- rename branch-2.0.patch to HBASE-20919-branch-2.0-01.patch > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is > no online region server, so the procedure can not be executed completely. > Then we shut down master, the ServerCrashProcedure and it`a subProcedures
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554436#comment-16554436 ] Hadoop QA commented on HBASE-20919: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 55s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 17s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 38s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 13s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 12m 56s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 35s{color} | {color:green} hbase-rsgroup in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 10s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 37m 7s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-20919 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12932901/HBASE-20919-branch-2.0-01.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 80b7d7f92dac 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / 5add96868a | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/13763/testReport/ | | Max. process+thread count | 2714 (vs. ulimit of 1) | | modules | C: hbase-rsgroup U: hbase-rsgroup | | Console output | https://builds.apache.org/
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554582#comment-16554582 ] Josh Elser commented on HBASE-20919: {quote}Delete old patch. {quote} You can leave the old patches attached, [~HB-CY]! We like seeing how your patches evolve :) {quote}That is a beautiful diagram {quote} +1 on this. You have made a very nice diagram to explain this complicated setup! {code:java} @@ -156,6 +158,7 @@ public class RSGroupBasedLoadBalancer implements RSGroupableBalancer { @Override public Map> roundRobinAssignment( List regions, List servers) throws HBaseIOException { +checkAndWaitInitialization();{code} What about failing fast here, and having the caller decide how to handle the retry logic? AssignmentManager should already have logic to do this. I am worried about this eventually converging. Specifically, RSGroupLoadBalancer doesn't get initialized until after hbase:meta gets assigned, but hbase:meta can't be assigned until the RSGroupLoadBalancer is initialized so we soft-lock. Have I missed something, [~HB-CY]? This is hard because, while I don't disagree with Stack's comment about StochasticLB to RSGroupLB, the Master using the LoadBalancer before it was initialized is bad. If you're seeing this on 2.0.1, then HBASE-20708 isn't related yet.. Do you have more logs you can share? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555706#comment-16555706 ] Ted Yu commented on HBASE-20919: {code} 481 LOG.info("waiting for balancer to be initialized, checkTimes:{}", checkTimes); {code} The log can be at DEBUG level. {code} 485 } catch (InterruptedException e) { {code} Please restore interrupt state in the catch block. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555778#comment-16555778 ] chenyang commented on HBASE-20919: -- hi, Ted Yu HBASE-20919-branch-2.0-01.patch is deprecated. HBASE-20919-branch-2.0-02.patch offers a better solution which not block current thread. Please review HBASE-20919-branch-2.0-02.patch, thank you. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555904#comment-16555904 ] Ted Yu commented on HBASE-20919: Checked HBASE-20919-branch-2.0-02.patch which seems fine. Triggered QA run: https://builds.apache.org/job/PreCommit-HBASE-Build/13789/ > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555922#comment-16555922 ] Ted Yu commented on HBASE-20919: {code} 2018-07-25 14:27:12,356 WARN [master/bjpg-rs4729:16000] assignment.AssignmentManager: unable to round-robin assignment org.apache.hadoop.hbase.HBaseIOException: RSGroupBasedLoadBalancer has not been initialized at org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.checkInitializedState(RSGroupBasedLoadBalancer.java:480) {code} Ultimately RSGroupBasedLoadBalancer would be initialized. Shouldn't the above log be at DEBUG level since there is nothing required from operator ? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556448#comment-16556448 ] Ted Yu commented on HBASE-20919: Triggered QA again: https://builds.apache.org/job/PreCommit-HBASE-Build/13798/ > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556972#comment-16556972 ] Ted Yu commented on HBASE-20919: QA bot finds latest attachment but doesn't know how to handle it: {code} 03:56:24 https://issues.apache.org/jira/secure/attachment/12933015/hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log -> Downloaded 03:56:25 ERROR: Unsure how to process HBASE-20919. {code} In the future, please attach logs first, patch the last. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.r
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558512#comment-16558512 ] Hadoop QA commented on HBASE-20919: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 48s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 37s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 30s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 44s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 13s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 11s{color} | {color:green} hbase-rsgroup in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 6s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 30m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-20919 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933158/HBASE-20919-branch-2.0-02.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux asf927.gq1.ygridcore.net 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / c11f0e4 | | maven | version: Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 13:51:28+) | | Default Java | 1.8.0_172 | | findbugs | v3.1.0-RC3 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/13815/testReport/ | | modules | C: hbase-rsgroup U: hbase-rsgroup | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/13815/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > meta region can't be re-onlined when restarting cluster if opening rsgro
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562005#comment-16562005 ] Ted Yu commented on HBASE-20919: +1 > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reassign hbase:meta region, but at that moment, there is
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562459#comment-16562459 ] Josh Elser commented on HBASE-20919: {quote}Because it need start, stop, and restart whole cluster to test the case, so i don`t know how to offer unit tests, do you or anyone have some suggestions? {quote} We have the ability to start/stop HBase services via the HBaseTestingUtility. Lots of examples in the codebase around this already. {{hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionMoveAndAbandon.java}} is one example that I did recently. {quote}I debug the initialization of rsgroup and test some cases. The initialization process is executed in a independent Thread. {quote} I feel like my understanding is wrong here, then. The master must be able (in some case) to re-assign hbase:meta w/o consulting RSGroupLoadBalancer or the RSGroupLoadBalancer can get itself initialized without hbase:meta being available. Given my current understanding, I wouldn't know how this could ever work. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562461#comment-16562461 ] Josh Elser commented on HBASE-20919: Sorry for the delay on the above responses, but I'd like to get some answers to these questions before I see this committed. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571609#comment-16571609 ] Josh Elser commented on HBASE-20919: [~HB-CY], just making sure you saw my last comment had questions for you in https://issues.apache.org/jira/browse/HBASE-20919?focusedCommentId=16562459&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16562459 :) > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code}
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396191#comment-17396191 ] yongzhi.shao commented on HBASE-20919: -- We had the same problem,we use V2.0.2 HBase cluster. Now,hmaster can not start. By reading the previous comments,I think this patch is not very stable. So, I don't want to use this patch if it's not necessary. But now,How can I get around this problem? Anybody got an idea? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Labels: balancer > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.as
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073008#comment-17073008 ] krish7919 commented on HBASE-20919: --- Is this fixed yet? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Labels: balancer > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:1693) > {code} > !bug2.png! > As shown in the figure named bug2.png listed in attachments, when we shutdown > the last region server, the master submit a ServerCrashProcedure. In the > procedure, it will reas
[jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073011#comment-17073011 ] Josh Elser commented on HBASE-20919: [~krish7919], as you can read, the issue was unresolved. Since we've not heard back from chenyang, I've marked this as incomplete. Someone can reopen this if the problem still exists in current (non-EOL) versions of 2.x. > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Assignee: ChenYang >Priority: Major > Labels: balancer > Attachments: HBASE-20919-branch-2.0-01.patch, > HBASE-20919-branch-2.0-02.patch, HBASE-20919-branch-2.0-02.patch, bug2.png, > hbase-hbase-master-bjpg-rs4729.yz02.no_02patch.log, > hbase-hbase-master-bjpg-rs4729.yz02.with_02patch.log, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > {code} > The logs show that hbase:meta region is not online and rsgroup keeps retrying > to initialize. > > but why the hbase:meta region is not online? > The info-level logs and jstack had not enough infomation, so I added some > debug logs in test-source-code. Then i checked the master`s logs and region > server`s logs, and found the meta region assign procedure which hold the meta > region lock not completed and not released the lock forever, so the > recoverMetaProcedure could not be executed. > > Why the first procedure not completed and not released meta region lock? > In the test logs, i found when assignmentManager assigned the region, it > need to call the rsgroup balancer which have not been initialized > completely, so throw NPE. As a result, the procedure not completed and not > released the lock forever. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.generateGroupMaps(RSGroupBasedLoadBalancer.java:262) > at > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:162) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignmentPlans(AssignmentManager.java:1864) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.processAssignQueue(AssignmentManager.java:1809) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.access$400(AssignmentManager.java:113) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager$2.run(AssignmentManager.java:169
***UNCHECKED*** [jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555265#comment-16555265 ] chenyang commented on HBASE-20919: -- hi, [~elserj]. Thanks for your suggestions. Q:"What about failing fast here, and having the caller decide how to handle the retry logic? AssignmentManager should already have logic to do this." A: Fast failing is a better solution, AssignmentManager catches HBaseIOException and re-add to PendingAssignmentQueue. Codes showed below in processAssignmentPlans() method: {code:java} try { acceptPlan(regions, balancer.retainAssignment(retainMap, servers)); } catch (HBaseIOException e) { LOG.warn("unable to retain assignment", e); addToPendingAssignment(regions, retainMap.keySet()); } //or try { acceptPlan(regions, balancer.roundRobinAssignment(hris, servers)); } catch (HBaseIOException e) { LOG.warn("unable to round-robin assignment", e); addToPendingAssignment(regions, hris); }{code} I will submit a new patch which implements fast failing. Q: "RSGroupLoadBalancer doesn't get initialized until after hbase:meta gets assigned, but hbase:meta can't be assigned until the RSGroupLoadBalancer is initialized so we soft-lock. " A: I debug the initialization of rsgroup and test some cases. The initialization process is executed in a independent Thread. For the moment, I don`t find soft-lock. But I think it is risk still. Q: "This is hard because, while I don't disagree with Stack's comment about StochasticLB to RSGroupLB, the Master using the LoadBalancer before it was initialized is bad" A: According my tests, it works to initialize balancers before calling startServiceThreads which starts ProcedureExecutor during HMaster`s finishActiveMasterInitialization method. But I can not make sure it`s ok for other cases. Maybe It needs more tests to do. So, I think the risks are lower to modify RSGroupBasedLoadBalancer. I will re-submit the patch which initializes balancers before calling startServiceThreads for reference only. Q: "Do you have more logs you can share? " A: I will offer whole logs and steps along with new patch. Because it need start, stop, and restart whole master to test the case, so i don`t know how to offer unit tests, do you or anyone have some suggestions? > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3226) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
***UNCHECKED*** [jira] [Commented] (HBASE-20919) meta region can't be re-onlined when restarting cluster if opening rsgroup
[ https://issues.apache.org/jira/browse/HBASE-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555316#comment-16555316 ] chenyang commented on HBASE-20919: -- Submit HBASE-20919-branch-2.0-02.patch which implements fast failing and logs. Testing steps(there are one master and one rs): 1: start master without 02.patch 2: start rs without 02.patch now, the cluster works fine 3: stop rs 4: stop master 5: restart master 6: restart rs now, the hbase:meta region can not be assign successfully. 7: stop rs 8: stop master 9: apply 02.path to branch-2.0, recompile hbase-rsgroup module and replace hbase-rsgroup-2.0.2-SNAPSHOT.jar with new version which includes 02.patch 10: restart master 11: restart rs now, the hbase:meta region can be assign successfully. cluster works fine. Logs: hbase-hbase-master-bjpg-rs4729.yz02.log.no_02patch includes logs across 1 to 8 steps. In the log file, you can see RSGroupInfoManagerImpl$RSGroupStartupWorker kept trying to check wether meta region is online, but failed every time. {code:java} 2018-07-25 12:15:15,064 INFO [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4729.yz02,16000,1532491114549] zookeeper.Me taTableLocator: Failed verification of hbase:meta,,1 at address=bjpg-rs4736.yz02,16020,1532490935452, exception=org.apache.hadoop.hbase.NotServingRegionExcep tion: hbase:meta,,1 is not online on bjpg-rs4736.yz02,16020,1532491949108 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3246) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3223) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegionInfo(RSRpcServices.java:1729) at org.apache.hadoop.hbase.shaded.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:28286) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) {code} hbase-hbase-master-bjpg-rs4729.yz02.log.with_02patch includes logs across 9 to 10 steps. In the log file, you can see that hbase:meta region was assigned successfully finally after failing some times. {code:java} 2018-07-25 14:27:12,356 WARN [master/bjpg-rs4729:16000] rsgroup.RSGroupBasedLoadBalancer: RSGroupBasedLoadBalancer has not been initialized org.apache.hadoop.hbase.HBaseIOException: RSGroupBasedLoadBalancer has not been initialized at org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.checkInitializedState(RSGroupBasedLoadBalancer.java:480) at org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer.roundRobinAssignment(RSGroupBasedLoadBalancer.java:161){code} > meta region can't be re-onlined when restarting cluster if opening rsgroup > -- > > Key: HBASE-20919 > URL: https://issues.apache.org/jira/browse/HBASE-20919 > Project: HBase > Issue Type: Bug > Components: Balancer, master, rsgroup >Affects Versions: 2.0.1 >Reporter: chenyang >Priority: Major > Attachments: HBASE-20919-branch-2.0-01.patch, bug2.png, > hbase-hbase-master-bjpg-rs4730.yz02.log.test > > > if you open rsgroup, hbase-site.xml contains below configuration. > {code:java} > > hbase.coprocessor.master.classes > org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint > > > hbase.master.loadbalancer.class > org.apache.hadoop.hbase.rsgroup.RSGroupBasedLoadBalancer > > {code} > And you shut down the whole HBase cluster in the way: > # first shut down region server one by one > # shut down master > Then you restart whole cluster in the way: > # start master > # start regionserver > The hbase:meta region can not be re-online and the rsgroup can not be > initialized successfully. > master logs: > {code:java} > 2018-07-12 18:27:08,775 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > rsgroup.RSGro > upInfoManagerImpl$RSGroupStartupWorker: Waiting for catalog tables to come > online > 2018-07-12 18:27:08,876 INFO > [org.apache.hadoop.hbase.rsgroup.RSGroupInfoManagerImpl$RSGroupStartupWorker-bjpg-rs4730.yz02,16000,1531389637409] > zookeeper.Met > aTableLocator: Failed verification of hbase:meta,,1 at > address=bjpg-rs4732.yz02,60020,1531388712053, > exception=org.apache.hadoop.hbase.NotServingRegionExcepti > on: hbase:meta,,1 is not online on bjpg-rs4732.yz02,60020,1531389727928 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3249) > at > org.apache.hadoop.hbase.regionserver.HRegionS